知识要求:
对于prometheus/alertmanager/grafana会简单使用,知道配置文件大概是做什么的,要不一些概念性东西你可能不理解,页面也不会操作,这里我不会太细的解释。
1. 系统环境
- 系统版本号CentOS 7.6
- docker Client版本号18.09.7, Server版本号18.09.7
- k8s版本号v1.16.2
- helm Client版本号v2.13.1,Server版本号v2.13.1
确认heml镜像源并更新镜像仓库
[root@ops1 test]# helm repo add stable http://mirror.azure.cn/kubernetes/charts/
[root@ops1 test]# helm repo list
NAME URL
local http://127.0.0.1:8879/charts
stable http://mirror.azure.cn/kubernetes/charts/
incubator http://mirror.azure.cn/kubernetes/charts-incubator/
[root@ops1 test]# helm repo update
2. 安装Prometheus Operator
查看并拉取prometheus压缩包,有兴趣的同学可以看看具体内容
[root@ops1 test]# helm search prometheus
stable/prometheus-operator 8.12.0 0.37.0 Provides easy monitoring definitions for Kubernetes servi...
[root@ops1 test]# helm fetch stable/prometheus-operator --version 8.12.0
[root@ops1 test]# tar -zxf prometheus-operator-8.12.0.tgz
tar: prometheus-operator/Chart.yaml:不可信的旧时间戳 1970-01-01 08:00:00
[root@ops1 test]# ls prometheus-operator
charts Chart.yaml CONTRIBUTING.md crds README.md requirements.lock requirements.yaml templates values.yaml
helm安装prometheus Operater,他的文件都是安装在命名空间monitoring下
[root@ops1 test]# cat <<EOF > prometheus-operator-values.yaml
alertmanager:
service: # 设置alertmanager网络类型,方便外网测试访问
nodePort: 30091
type: NodePort
alertmanagerSpec:
storage: # 我使用了永久存储,如果做测试,不用写这一段
volumeClaimTemplate:
spec:
storageClassName: prometheus-k8s
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 20Gi
grafana:
service: # 设置prometheus网络类型,方便外网测试访问
type: NodePort
nodePort: 30092
prometheus:
service: # 设置prometheus网络类型,方便外网测试访问
nodePort: 30090
type: NodePort
prometheusSpec:
storageSpec: # 我使用了永久存储,如果做测试,不用写这一段
volumeClaimTemplate:
spec:
storageClassName: prometheus-k8s
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 20Gi
kubeEtcd:
service: # 1.16.2版本的etcd的检测端口为2381
port: 2381
targetPort: 2381
EOF
[root@ops1 test]# helm install --name prometheus-operator --version=8.12.0 -f prometheus-operator-values.yaml \
--namespace=monitoring stable/prometheus-operator
NAME: prometheus-operator
...... .......
NOTES:
The Prometheus Operator has been installed. Check its status by running:
kubectl --namespace monitoring get pods -l "release=prometheus-operator"
Visit https://github.com/coreos/prometheus-operator for instructions on how
to create & configure Alertmanager and Prometheus instances using the Operator.
[root@ops1 test]# kubectl get crd | grep monitoring
alertmanagers.monitoring.coreos.com 2020-04-08T02:59:54Z
podmonitors.monitoring.coreos.com 2020-04-08T02:59:57Z
prometheuses.monitoring.coreos.com 2020-04-08T02:59:57Z
prometheusrules.monitoring.coreos.com 2020-04-08T03:00:00Z
servicemonitors.monitoring.coreos.com 2020-04-08T03:00:02Z
thanosrulers.monitoring.coreos.com 2020-04-08T03:00:05Z
[root@ops1 prometheus-operator]# kubectl get svc -n monitoring
1. prometheus访问界面: http://192.168.70.122:30090/graph#/alerts
2. alertmanager告警界面: http://192.168.70.122:30091/#/alerts
3. grafana界面,默认账号密码:admin/prom-operator: http://192.168.70.122:30092/dashboards ,
3. 配置prometheus监控和告警规则
我们查看prometheus文件,
[root@ops1 test]# kubectl get sts prometheus-prometheus-operator-prometheus -o yaml
- args:
- --config.file=/etc/prometheus/config_out/prometheus.env.yaml
volumeMounts:
- mountPath: /etc/prometheus/config_out
name: config-out
readOnly: true
- emptyDir: {}
name: config-out
我们发现,prometheus的yaml配置文件就是pod本地文件,并没有用已有的,这里就引出了我们原来的的概念,即prometheus的yaml配置文件是由Operator控制的,如图。
上图是Prometheus-Operator官方提供的架构图,其中Operator是最核心的部分,作为一个控制器,他会去创建
Prometheus、ServiceMonitor、AlertManager以及PrometheusRule4个CRD资源对象,然后会一直监控并维持这4个资源对象的状态。
其中创建的prometheus这种资源对象就是作为Prometheus Server存在,而ServiceMonitor就是exporter的各种抽象,exporter前面我们已经学习了,是用来提供专门提供metrics数据接口的工具,Prometheus就是通过ServiceMonitor提供的metrics数据接口去 pull 数据的,当然alertmanager这种资源对象就是对应的AlertManager的抽象,而PrometheusRule是用来被Prometheus实例使用的报警规则文件。
这样我们要在集群中监控什么数据,就变成了直接去操作 Kubernetes 集群的资源对象了,是不是方便很多了。上图中的
Service 和 ServiceMonitor 都是 Kubernetes 的资源,一个 ServiceMonitor 可以通过 labelSelector 的方式去匹配一类 Service,Prometheus 也可以通过 labelSelector 去匹配多个ServiceMonitor。
[root@ops1 test]# kubectl get prometheus
NAME VERSION REPLICAS AGE
prometheus-operator-prometheus v2.15.2 1 21m
[root@ops1 test]# kubectl get prometheus prometheus-operator-prometheus -o yaml
spec:
alerting:
alertmanagers:
- apiVersion: v2
name: prometheus-operator-alertmanager
namespace: monitoring
pathPrefix: /
port: web
baseImage: quay.io/prometheus/prometheus
enableAdminAPI: false
externalUrl: http://prometheus-operator-prometheus.monitoring:9090
listenLocal: false
logFormat: logfmt
logLevel: info
paused: false
podMonitorNamespaceSelector: {}
podMonitorSelector: # 监控规则,通过crd对象中podmonitors带有这两个标签会被选中
matchLabels:
release: prometheus-operator
portName: web
replicas: 1
retention: 10d
routePrefix: /
ruleNamespaceSelector: {}
ruleSelector: # 报警规则,通过crd对象中prometheusrules带有这两个标签会被选中
matchLabels:
app: prometheus-operator
release: prometheus-operator
securityContext:
fsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
serviceAccountName: prometheus-operator-prometheus
serviceMonitorNamespaceSelector: {}
serviceMonitorSelector:
matchLabels:
release: prometheus-operator
storage: # 这里的存储,就是咱们原来定义的,如果不定义,则为空
volumeClaimTemplate:
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
storageClassName: prometheus-k8s
version: v2.15.2
我们先来配置服务监控.
首先,我们建立两个tomcat,提供metrics接口,你用其他的服务,也可以。
[root@ops1 test]# kubectl create ns tomcat
namespace/tomcat created
[root@ops1 test]# cat <<EOF > tomcat-test1.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: tomcat-test1
namespace: tomcat
labels:
k8s.eip.work/layer: svc
k8s.eip.work/name: tomcat-test1
spec:
replicas: 1
selector:
matchLabels:
k8s.eip.work/layer: svc
k8s.eip.work/name: tomcat-test1
template:
metadata:
labels:
k8s.eip.work/layer: svc
k8s.eip.work/name: tomcat-test1
spec:
containers:
- name: tomcat-test1
image: 'registry.cn-beijing.aliyuncs.com/wangzt/k8s/tomcat:v1.3'
---
apiVersion: v1
kind: Service
metadata:
name: tomcat-test1
namespace: tomcat
labels:
k8s.eip.work/layer: svc
k8s.eip.work/name: tomcat-test1
spec:
selector:
k8s.eip.work/layer: svc
k8s.eip.work/name: tomcat-test1
type: NodePort
ports:
- name: tomcat-web
port: 80
targetPort: 8080
- name: metrics
port: 9090
targetPort: 9090
EOF
[root@ops1 test]# kubectl apply -f tomcat-test1.yaml
[root@ops1 test]# cp tomcat-test1.yaml tomcat-test2.yaml && sed -i 's&tomcat-test1&tomcat-test2&' tomcat-test2.yaml && \
sed -i 's&v1.3&v0.8&' tomcat-test2.yaml && kubectl apply -f tomcat-test2.yaml
这时我们可以发现,tomcat-test1是好的,tomcat-test2是不可用的,这样方便对比
[root@ops1 test]# curl http://10.100.33.236:9090/metrics
# HELP tomcat_bytesreceived_total Tomcat global bytesReceived
# TYPE tomcat_bytesreceived_total counter
对命名空间tomcat带有标签k8s.eip.work/layer: svc的服务进行监控
[root@ops1 test]# cat <<EOF > prometheus-serviceMonitorTomcatTest.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor # 提交给这个crd接收
metadata:
labels:
app: prometheus-operator-tomcat-test
chart: prometheus-operator-8.12.3
release: prometheus-operator # 根据这个标签进行筛选
name: prometheus-operator-tomcat-test
namespace: monitoring
spec:
endpoints:
- interval: 30s # 每30s获取一次信息
path: /metrics # 对应service的访问路径
port: metrics # 对应service的端口名
jobLabel: k8s.eip.work/layer
namespaceSelector: # 表示去匹配某一命名空间中的service,如果想从所有的namespace中匹配用any: true
matchNames:
- tomcat
selector: # 匹配的 Service 的labels,如果使用mathLabels,则下面的所有标签都匹配时才会匹配该service,如果使用matchExpressions,则至少匹配一个标签的service都会被选择
matchLabels:
k8s.eip.work/layer: svc # 匹配servic带这个标签
EOF
[root@ops1 test]# kubectl apply -f prometheus-serviceMonitorTomcatTest.yaml
servicemonitor.monitoring.coreos.com/prometheus-operator-tomcat-test created
这时我们访问prometheus配置界面,就显示一个好用一个不好用了
3. 配置报警触发规则
服务可用性超过一半,我们先来看这条语句
然后我们来配置报警规则。服务死亡率超过一半则报警。
添加标签k8s_eip_work_layer: svc,alertmanager将会根据此标签选择报警规则
[root@ops1 test]# cat <<EOF > prometheus-operator-tomcat-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
chart: prometheus-operator-8.12.3
heritage: Tiller
app: prometheus-operator
release: prometheus-operator
name: prometheus-operator-tomcat-test.rules
namespace: monitoring
spec:
groups:
- name: tomcat-test.rules
rules:
- alert: tomcat-down
expr: count( up{namespace="tomcat"} == 0 )by (job) > ( count(up{namespace="tomcat"})by (job) / 2 - 1)
for: 2m
labels:
alertManagerRule: node # 注意这行,alertmangar要根据这个标签来进行报警
annotations:
description: "{{$labels.instance}}: Tomcat Service Is Down"
EOF
[root@ops1 test]# kubectl apply -f prometheus-operator-tomcat-rules.yaml
prometheusrule.monitoring.coreos.com/prometheus-operator-tomcat-test.rules created
我们可以进入到容器里,看到规则已经添加进去了
[root@ops1 test]# kubectl exec -it prometheus-prometheus-operator-prometheus-0 /bin/sh -n monitoring
Defaulting container name to prometheus.
Use 'kubectl describe pod/prometheus-prometheus-operator-prometheus-0 -n monitoring' to see all of the containers in this pod.
/prometheus $ ls /etc/prometheus/rules/prometheus-prometheus-operator-prometheus-rulefiles-0/
然后我们就能在
http://192.168.70.122:30090/alerts 里的Pending里看见了。
我们可以看到页面中出现了我们刚刚定义的报警规则信息,而且报警信息中还有状态显示。一个报警信息在生命周期内有下面3种状态:
- inactive: 表示当前报警信息既不是firing状态也不是pending状态
- pending: 表示在设置的阈值时间范围内被激活了
- firing: 表示超过设置的阈值时间被激活了
等时间到了,默认2分钟,状态就会变为Firing,然后出发报警规则,发送报警信息给alertmanager
这时我们去alertmanager里就能看到报警被触发了。
好了,下一步我们去配置邮件和钉钉报警
4. alertmanager告警
我们先查看alertmanager的配置文件,发现alertmanager配置文件是通过secret来配置的。
[root@ops1 test]# kubectl get sts alertmanager-prometheus-operator-alertmanager -o yaml
- args:
- --config.file=/etc/alertmanager/config/alertmanager.yaml
volumeMounts:
- mountPath: /etc/alertmanager/config
name: config-volume
volumes:
- name: config-volume
secret:
defaultMode: 420
secretName: alertmanager-prometheus-operator-alertmanager # 文件位置
[root@ops1 test]# kubectl get secret alertmanager-prometheus-operator-alertmanager -o yaml > alertmanager-cm-old.yaml
apiVersion: v1
data:
alertmanager.yaml: Z2xvYmFsOgogIHJlc29sdmVfdGltZW91dDogNW0KcmVjZWl2ZXJzOgotIG5hbWU6ICJudWxsIgpyb3V0ZToKICBncm91cF9ieToKICAtIGpvYgogIGdyb3VwX2ludGVydmFsOiA1bQogIGdyb3VwX3dhaXQ6IDMwcwogIHJlY2VpdmVyOiAibnVsbCIKICByZXBlYXRfaW50ZXJ2YWw6IDEyaAogIHJvdXRlczoKICAtIG1hdGNoOgogICAgICBhbGVydG5hbWU6IFdhdGNoZG9nCiAgICByZWNlaXZlcjogIm51bGwiCg==
[root@ops1 test]# echo "Z2xvYmFsOgogIHJlc29sdmVfdGltZW91dDogNW0KcmVjZWl2ZXJzOgotIG5hbWU6ICJudWxsIgpyb3V0ZToKICBncm91cF9ieToKICAtIGpvYgogIGdyb3VwX2ludGVydmFsOiA1bQogIGdyb3VwX3dhaXQ6IDMwcwogIHJlY2VpdmVyOiAibnVsbCIKICByZXBlYXRfaW50ZXJ2YWw6IDEyaAogIHJvdXRlczoKICAtIG1hdGNoOgogICAgICBhbGVydG5hbWU6IFdhdGNoZG9nCiAgICByZWNlaXZlcjogIm51bGwiCg==" | base64 -d
global:
resolve_timeout: 5m
receivers:
- name: "null"
route:
group_by:
- job
group_interval: 5m
group_wait: 30s
receiver: "null"
repeat_interval: 12h
routes:
- match:
alertname: Watchdog
receiver: "null"
添加邮箱报警 alertmanager.yaml
我们原来看了,alertmanager配置文件里并没有多少东西,我们重新配置。这里我们配置了两种报警规则,邮件和钉钉
[root@ops1 test]# cat <<EOF > alertmanager.yaml
global:
# 在没有报警的情况下声明为已解决的时间
resolve_timeout: 5m
# 配置邮件发送信息
smtp_smarthost: 'smtp.exmail.qq.com:25'
smtp_from: '[email protected]'
smtp_auth_username: '[email protected]'
smtp_auth_password: "${mima}"
smtp_hello: '[email protected]'
smtp_require_tls: false
# 所有报警信息进入后的根路由,用来设置报警的分发策略
route:
# 这里的标签列表是接收到报警信息后的重新分组标签,例如,接收到的报警信息里面有许多具有 cluster=A 和 alertname=LatncyHigh 这样的标签的报警信息将会批量被聚合到一个分组里面
group_by: ['alertname', 'cluster']
# 当一个新的报警分组被创建后,需要等待至少group_wait时间来初始化通知,这种方式可以确保您能有足够的时间为同一分组来获取多个警报,然后一起触发这个报警信息。
group_wait: 30s
# 当第一个报警发送后,等待'group_interval'时间来发送新的一组报警信息。
group_interval: 30s
# 如果一个报警信息已经发送成功了,等待'repeat_interval'时间来重新发送他们
repeat_interval: 2m
# 默认的receiver:如果一个报警没有被一个route匹配,则发送给默认的接收器
receiver: default
# 上面所有的属性都由所有子路由继承,并且可以在每个子路由上进行覆盖。
routes:
- receiver: email
group_wait: 10s
match:
alertManagerRule: node #根据此标签,选择报警规则
# - receiver: webhook
# match:
# alertManagerRule: node #根据此标签,选择报警规则
receivers:
- name: 'default'
email_configs:
- to: '[email protected]'
send_resolved: true
- name: 'email'
email_configs:
- to: '[email protected]'
send_resolved: true
webhook_configs:
- url: 'http://dingtalk-hook:5000'
send_resolved: true
- name: 'webhook'
webhook_configs:
- url: 'http://dingtalk-hook:5000'
send_resolved: true
EOF
[root@ops1 test]# kubectl delete secret alertmanager-prometheus-operator-alertmanager -n monitoring
secret "alertmanager-prometheus-operator-alertmanager" deleted
[root@ops1 test]# kubectl create secret generic alertmanager-prometheus-operator-alertmanager --from-file=alertmanager.yaml -n monitoring
secret/alertmanager-prometheus-operator-alertmanager created
稍等一分钟,等待配置生效,我们访问alertmanager配置文件, http://192.168.70.122:30091/#/status
发现配置已经生效。
这时我们就可以等待,看看邮箱里没有没邮件了
添加钉钉报警
大家发现,我们的报警里还有另外一个规则,就是name: 'webhook',通过网络url进行报警,我们这里配置的是钉钉,先来测试
curl 'https://oapi.dingtalk.com/robot/send?access_token='$token'' \ -H 'Content-Type: application/json' -d '{"msgtype": "text", "text": { "content": "我就是我, 是不一样的烟火2"}}'
[root@ops1 test]# kubectl create secret generic dingtalk-secret --from-literal=token=$token -n monitoring
secret/dingtalk-secret created # 设置token
[root@ops1 test]# cat <<EOF > dingtalk-hook.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: dingtalk-hook
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: dingtalk-hook
template:
metadata:
labels:
app: dingtalk-hook
spec:
containers:
- name: dingtalk-hook
image: registry.cn-beijing.aliyuncs.com/wangzt/k8s/dingtalk-hook:0.1
# image: cnych/alertmanager-dingtalk-hook:v0.2, 修改的此镜像,去掉了json
imagePullPolicy: IfNotPresent
ports:
- containerPort: 5000
name: http
env:
- name: ROBOT_TOKEN
valueFrom:
secretKeyRef:
name: dingtalk-secret
key: token
resources:
requests:
cpu: 50m
memory: 100Mi
limits:
cpu: 50m
memory: 100Mi
---
apiVersion: v1
kind: Service
metadata:
name: dingtalk-hook
namespace: monitoring
spec:
selector:
app: dingtalk-hook
ports:
- name: hook
port: 5000
targetPort: http
EOF
[root@ops1 test]# kubectl apply -f dingtalk-hook.yaml
deployment.apps/dingtalk-hook created
service/dingtalk-hook created
我们就可以在钉钉报警里看见了
prometheus添加证书
[root@ops1 prometheus]# kubectl -n monitoring create secret generic etcd-certs --from-file=/etc/kubernetes/pki/etcd/healthcheck-client.crt --from-file=/etc/kubernetes/pki/etcd/healthcheck-client.key --from-file=/etc/kubernetes/pki/etcd/ca.crt
secret/etcd-certs created
[root@ops1 test]# kubectl exec -it prometheus-prometheus-operator-prometheus-0 /bin/sh -n monitoring
Defaulting container name to prometheus.
Use 'kubectl describe pod/prometheus-prometheus-operator-prometheus-0 -n monitoring' to see all of the containers in this pod.
/prometheus $ ls /etc/prometheus/secrets/etcd-certs/
ca.crt healthcheck-client.crt healthcheck-client.key
5. prometheus收集java信息
[root@dev3_worker bin]# cat <<EOF > config.yaml
---
lowercaseOutputLabelNames: true
lowercaseOutputName: true
rules:
- pattern: 'Catalina<type=GlobalRequestProcessor, name=\"(\w+-\w+)-(\d+)\"><>(\w+):'
name: tomcat_$3_total
labels:
port: "$2"
protocol: "$1"
help: Tomcat global $3
type: COUNTER
- pattern: 'Catalina<j2eeType=Servlet, WebModule=//([-a-zA-Z0-9+&@#/%?=~_|!:.,;]*[-a-zA-Z0-9+&@#/%=~_|]), name=([-a-zA-Z0-9+/$%~_-|!.]*), J2EEApplication=none, J2EEServer=none><>(requestCount|maxTime|processingTime|errorCount):'
name: tomcat_servlet_$3_total
labels:
module: "$1"
servlet: "$2"
help: Tomcat servlet $3 total
type: COUNTER
- pattern: 'Catalina<type=ThreadPool, name="(\w+-\w+)-(\d+)"><>(currentThreadCount|currentThreadsBusy|keepAliveCount|pollerThreadCount|connectionCount):'
name: tomcat_threadpool_$3
labels:
port: "$2"
protocol: "$1"
help: Tomcat threadpool $3
type: GAUGE
- pattern: 'Catalina<type=Manager, host=([-a-zA-Z0-9+&@#/%?=~_|!:.,;]*[-a-zA-Z0-9+&@#/%=~_|]), context=([-a-zA-Z0-9+/$%~_-|!.]*)><>(processingTime|sessionCounter|rejectedSessions|expiredSessions):'
name: tomcat_session_$3_total
labels:
context: "$2"
host: "$1"
help: Tomcat session $3 total
type: COUNTER
EOF
收集tomcat数据
Jar包应用
安装github上的方式启动就好了
java -javaagent:./jmx_prometheus_javaagent-0.12.0.jar=8080:config.yaml -jar yourJar.jar
Tomcat war包应用
进入bin目录$TOMCAT_HOME/bin,将jmx_exporter.jar包文件和config.yaml文件复制到这里。然后修改里面的一个catalina.sh的脚本,找到JAVA_OPTS,加上以下配置(代理):
如果有多tomcat,建议将jmx_prometheus_javaagent和config.yaml文件放到固定的目录,$TOMCAT_HOME/bin/catalina.sh文件中写绝对路径.
#修改bin/catalina.sh 文件 添加: JAVA_OPTS="-javaagent:bin/jmx_prometheus_javaagent-0.12.0.jar=39081:bin/config.yaml"
如果是war包应用
-Djava.util.logging.config.file=/path/to/logging.properties
7. prometheus修改默认不可监控服务
1. prometheus-operator-kube-etcd和prometheus-operator-kube-proxy异常
1. prometheus-operator-kube-etcd,查看配置文件发现,默认监听端口号为 1381
curl http://127.0.0.1:2381/metrics | head
. /etc/kubernetes/manifests/etcd.yaml
修改- --listen-metrics-urls=http://127.0.0.1:2381 为本机地址
- --listen-metrics-urls=http://0.0.0.0:2381
. kubectl edit svc prometheus-operator-kube-etcd -n kube-system 修改svc监听端口端口号为1381
2. prometheus-operator-kube-proxy异常
访问发现,他也是只监听127.0.0.1端口
[root@ops1 manifests]# kubectl get svc prometheus-operator-kube-proxy -o yaml -n kube-system
[root@ops1 manifests]# kubectl get ds kube-proxy -o yaml -n kube-system
kubectl edit cm kube-proxy -n kube-system
改为监听所有地址
8. 删除prometheus服务
如果crd不需要也要一起删除
helm del --purge prometheus-operator
# removed CRDS
kubectl delete crd prometheuses.monitoring.coreos.com
kubectl delete crd prometheusrules.monitoring.coreos.com
kubectl delete crd servicemonitors.monitoring.coreos.com
kubectl delete crd podmonitors.monitoring.coreos.com
kubectl delete crd alertmanagers.monitoring.coreos.com
kubectl delete crd thanosrulers.monitoring.coreos.com
kubectl get crd | grep monitoring