Monitoring, as a part of the underlying infrastructure, is an indispensable part of ensuring the stability of production environment services. From discovery to location to resolution, online problems can effectively cover "discovery" and "location" through monitoring and warning methods. , It can even be solved by means such as self-healing of faults. Service development and operation and maintenance personnel can find abnormalities in service operation in a timely and effective manner, so as to troubleshoot and solve problems more efficiently.
One, Prometheus introduction
A typical monitoring (such as white box monitoring) usually focuses on the internal state of the target service, for example:
- Number of requests received per unit time
- Request success rate/failure rate per unit time
- Average processing time of requests
White box monitoring describes the internal state of the system very well, but lacks the phenomena seen from an external perspective. For example, white box monitoring can only see the requests that have been received, and cannot see the requests that were not sent successfully due to DNS failure , And black box monitoring can be used as a supplementary means at this time, the probe program is used to detect whether the target service has successfully returned, so as to better feedback the current state of the system.
One day you need to build a monitoring system for the service to collect the indicators reported by the application buried points. Prometheus's business monitoring, because it has the following advantages:
① 支持PromQL(一种查询语言),可以灵活地聚合指标数据
② 部署简单,只需要一个二进制文件就能跑起来,不需要依赖分布式存储
③ Go语言编写,组件更方便集成在同样是Go编写项目代码中
④ 原生自带WebUI,通过PromQL渲染时间序列到面板上
⑤ 生态组件众多,Alertmanager,Pushgateway,Exporter...
Prometheus architecture is as follows:
In the above process, Prometheus specified by the service discovery profile manner 确定要拉取监控指标的目标(Target)
, to the next 拉取的目标(应用容器和Pushgateway)
, 发起HTTP请求
to a particular endpoint (Metric Path), the index 持久化
to the TSDb of itself, will eventually TSDb of memory In addition, Prometheus will periodically calculate the set alarm rules through PromQL to determine whether to generate an alarm to Alertmanager. After receiving the alarm, the latter will be responsible for sending the notification to the email or internal group chat. in.
Prometheus indicator names can only consist of ASCII characters, numbers, underscores and colons, and there is a set of naming conventions:
① 使用基础 Unit(如 seconds 而非 milliseconds)
② 指标名以 application namespace 作为前缀,如:
process_cpu_seconds_total
http_request_duration_seconds
③ 用后缀来描述 Unit,如:
http_request_duration_seconds
node_memory_usage_bytes
http_requests_total
process_cpu_seconds_total
foobar_build_info
Prometheus provides the following basic indicator types:
- Counter: represents an indicator of monotonic increase in sample data, that is, it only increases without decreasing. It is usually used to count the number of service requests and errors.
- Gauge: Represents an indicator whose sample data can be changed arbitrarily, which can be increased or decreased. It is usually used for statistics such as the service CPU usage value and memory usage value.
- Histogram and Summary: used to represent data sampling and point quantile map statistical results over a period of time, usually used to count request time or response size, etc.
Prometheus is based on 时间序列
storage. First, understand what a time series is. The format of a time series is similar to (timestamp,value)
this format, that is, a time point has a corresponding value, such as a weather forecast that is very common in life, such as: [(14:00, 27°C), (15:00, 28°C), (16:00, 26°C)], is a single-dimensional time series, this kind of sequence stored according to timestamp and value is also called a vector (vector) .
Let me give another example. As shown in the figure above, if there is an indicator http_requests
, its function is to count the total amount of requests corresponding to each time period. At this time, it is a single-dimensional matrix mentioned above, and when We give this indicator 加上一个维度:主机名
. At this time, the role of the indicator becomes 统计每个时间段各个主机名对应的请求量是多少
. At this time, the matrix area becomes a time series with multiple column vectors (each column corresponds to a host name). When adding multiple labels to this time series (key=value)
At this time, this matrix will correspondingly become a multi-dimensional matrix.
Each set of unique labels corresponds to a unique vector (vector), which can also be called a time sequence (Time Serie). When looking at it at a certain point in time, it is an instant vector (Instant Vector). The timing of the vector has only one point in time and one value for it, such as: the CPU load of the server at 12:05:30 today; when looking at it in a time period, it is a range vector (Range Vector) For a set of time series data, such as: the server's CPU load from 11:00 to 12:00 today.
Similarly, you can query the eligible time series through the indicator name and label set:
http_requests{host="host1",service="web",code="200",env="test"}
The query result will be an instantaneous vector:
http_requests{host="host1",service="web",code="200",env="test"} 10
http_requests{host="host2",service="web",code="200",env="test"} 0
http_requests{host="host3",service="web",code="200",env="test"} 12
If you add a time parameter to this condition, query the time series within a period of time:
http_requests{host="host1",service="web",code="200",env="test"}[:5m]
The result will be a range vector:
http_requests{host="host1",service="web",code="200",env="test"} 0 4 6 8 10
http_requests{host="host2",service="web",code="200",env="test"} 0 0 0 0 0
http_requests{host="host3",service="web",code="200",env="test"} 0 2 5 9 12
With range vectors, can we perform some aggregation operations on these time series? That's right, PromQL does this. For example, if we want to calculate the request growth rate in the last 5 minutes, we can use the above range vector plus the aggregate function to do the calculation:
rate(http_requests{host="host1",service="web",code="200",env="test"}[:5m])
For example, to request the increase in the last 5 minutes, you can use the following PromQL:
increase(http_requests{host="host1",service="web",code="200",env="test"}[:5m])
To calculate the 90th percentile in the past 10 minutes:
histogram_quantile(0.9, rate(employee_age_bucket_bucket[10m]))
In Prometheus, a metric (that is, a metric with a unique tag set) and a (timestamp,value)
sample form a sample. Prometheus puts the collected samples in the memory, and compresses the data into a block every 2 hours by default for persistence To the hard disk, the more the number of samples, the higher the memory occupied by Prometheus. Therefore, in practice, generally 不建议用区分度(cardinality)太高的标签
, such as: user IP, ID, URL address, etc., otherwise the result will cause the number of time series to increase exponentially (label Multiply the number). In addition to controlling the number and size of samples to be reasonable, you can also 降低 storage.tsdb.min-block-duration
speed up the data placement time and 增加 scrape interval
increase the pull interval to control the memory occupied by Prometheus.
By declaring in the configuration file scrape_configs
to specify the target that Prometheus needs to pull indicators at runtime, the target instance needs to implement an endpoint that can be polled by Prometheus, and to implement such an interface, it can be used to provide Prometheus with monitoring sample data The independent program is generally called Exporter, for example Node Exporter
, it is used to pull operating system indicators , it will collect hardware indicators from the operating system for Prometheus to pull.
In the development environment, often only one Prometheus instance needs to be deployed to meet the collection of hundreds of thousands of indicators. However, in the production environment, there are a large number of application and service instances. It is usually not enough to deploy only one Prometheus instance. It is better to deploy multiple Prometheus instances, and each instance only pulls a part of the indicators through partitions, such as in the Prometheus Relabel configuration. The hashmod function of, you can hashmod the address of the pull target, and then keep the target whose result matches its ID:
relabel_configs:
- source_labels: [__address__]
modulus: 3
target_label: __tmp_hash
action: hashmod
- source_labels: [__tmp_hash]
regex: $(PROM_ID)
action: keep
In other words, if we want each Prometheus to pull a cluster's metrics, we can also use Relabel to complete:
relabel_configs:
- source_labels: ["__meta_consul_dc"]
regex: "dc1"
action: keep
Two, Prometheus high availability
Now that each Prometheus has its own data, how do you associate them and establish a global view? The official provides a method:, 联邦集群(federation)
that is, the Prometheuse Server is layered according to a tree structure, and the Prometheus in the direction of the root node will query the Prometheus instance of the leaf node, and then aggregate the indicators back.
However, it is obvious that the use of federated clusters still cannot solve the problem. First of all, the single point problem still exists. If the root node is down, the query will become unavailable. If multiple parent nodes are configured, it will cause data redundancy and capture. Timing leads to problems such as data inconsistency, and when the number of leaf node targets is too large, it is more likely to increase the pressure on the parent node and even full downtime. In addition, rule configuration management is also a big trouble.
Fortunately, a Prometheus cluster solution appeared in the community:, Thanos
it provides a global query view, which can query and aggregate data from multiple Prometheus, because all of this data can be obtained from a single endpoint.
1. When Querier receives a request, it will send a request to the relevant Sidecar and obtain time series data from their Prometheus server.
2. It aggregates these response data and executes PromQL queries on them. It can aggregate disjoint data and deduplicate data for Prometheus' high availability group.
With Thanos
this, it Prometheus的水平扩展
can become easier. Not only that, Thanos also provides a reliable data storage solution that can monitor and back up Prometheus local data to remote storage. In addition, because Thanos provides the Prometheus cluster 全局视图
, the recording rules for global Prometheus are not a problem. The Ruler component provided by Thanos will execute the rules and issue alarms based on Thanos Querier.
Three, Prometheus storage
Speaking of storage , the high availability of Prometheus queries can be solved in 水平扩展+统一查询视图
a way, so how to solve the high availability of storage? In the design of Prometheus, data is persisted in local storage. Although local persistence is convenient, it will also bring some troubles. For example, if the node is down or Prometheus is scheduled to other nodes, it will mean the original The monitoring data on the node is lost in the query interface. The local storage makes Prometheus unable to expand flexibly. For this reason, Prometheus provides the Remote Read
and Remote Write
function to support remote writing of the Prometheus time series to the remote storage, which can be stored remotely during query Read the data.
In one example M3DB
, M3DB is one 分布式的时间序列数据库
. It provides a remote read and write interface for Prometheus. When a time series is written to the M3DB cluster, the data will be copied to the cluster according to the shard and replication factor parameters. On other nodes, high storage availability is realized. In addition to M3DB, Prometheus currently supports InfluxDB, OpenTSDB, etc. as endpoints for remote writing.
Four, Prometheus collects data method
1, pull mode
to solve the high availability of Prometheus, let's look at how to monitor Prometheus target acquisition, as 监控节点数量较小
when, by Static Config
the target host of the list wrote Prometheus pull configuration, but if the target node over one of the words in this way Management is a big problem, and in a production environment, the IP of the service instance is usually not fixed. At this time, there is no way to effectively manage the target node with static configuration Prometheus提供的服务发现功能便可以有效解决监控节点状态变化的问题
. At this time , in this mode, Prometheus will arrive The registry monitors and queries the list of nodes, and periodically pulls indicators for nodes. If there are more flexible requirements and Prometheus也支持基于文件的服务发现
functions for service discovery , at this time we can obtain the node list from multiple registries, filter by our own requirements, and finally write it to the file, which is detected by Prometheus 文件变化后便能动态地替换监控节点,再去拉取目标
.
2. In the
front of Pushgateway, I saw that Prometheus regularly crawls the target node in the pull mode. If there is a situation where some task nodes have run out before they can be pulled, then the monitoring data will be lost. In order to deal with this situation, Prometheus provides a tool:, Pushgateway
used to receive from the service 主动上报
, it is suitable for those short-lived batch tasks to push and temporarily store indicators on itself, and then Prometheus can pull itself, In order to prevent the indicator from exiting before being pulled by Prometheus. In addition, Pushgateway is also suitable for the problem that when Prometheus and application nodes are running on a heterogeneous network or are isolated by a firewall, nodes cannot be actively pulled. In this case, application nodes can push indicators to Pushgateway instances by using the domain name of Pushgateway on, Prometheus can pull Pushgateway nodes in the same network, the other should pay attention to when configuring a pull Pushgateway question : Prometheus will endow each indicatorjob
And instance
tags. When Prometheus pulls Pushgateway, the job
and instance
may be the ip of the Pushgateway and Pushgateway hosts respectively. When the index reported by pushgateway also contains the job
and instance
tag, Prometheus will rename the conflicting tag to exported_job
and exported_instance
, if you need to overwrite the two If a label is required, it needs to be configured in Prometheus honor_labels: true
.
Pushgateway can replace the pull model as an indicator collection scheme, but this model will bring many negative effects:
Pushgateway is designed as a cache of monitoring indicators, which means it will not actively expire the indicators reported by the service. This situation will not cause problems when the service is running, but when the service is rescheduled or destroyed, Pushgateway will still The indicators reported by the previous node are retained. Moreover, if multiple Pushgateway runs under LB, a monitoring indicator may appear on multiple Pushgateway instances, resulting in multiple copies of data. It is necessary to add consistent hash routing at the proxy layer to solve the problem
in pull mode. Prometheus can more easily view the health status of the monitored target instance, and can quickly locate faults, but in push mode, since it does not actively detect the client, it also becomes ignorant of the health status of the target instance
Five, Prometheus' Alertmanager
Alertmanager is an alarm component separated from Prometheus. It mainly receives alarm events sent by Promethues, and then deduplicates, groups, suppresses and sends the alarms. In practice, it can be used with webhook to send alarm notifications to enterprise WeChat or DingTalk. The architecture diagram is as follows:
Six, Kubernetes builds Prometheus monitoring system
Although Promehteus already has an official Operator, in order to learn to write yaml files manually, it is very convenient to complete the whole process, and only a few instances can be used to collect and monitor 200+ service indicators for thousands of instances. .
In order to deploy the Prometheus instance, you need to declare Prometheus的StatefulSet
that the Pod includes three containers, respectively, Prometheus
and 绑定的Thanos Sidecar
finally add one watch容器
to monitor the changes of the prometheus configuration file. When the ConfigMap is modified, the Prometheus Reload API can be automatically called to complete the configuration loading. Here According to the data partitioning method mentioned earlier, an environment variable is added before Prometheus starts PROM_ID
, as the identifier of hashmod during Relabel, and POD_NAME
used as the one specified by Thanos Sidecar for Prometheus external_labels.replica
:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: prometheus
labels:
app: prometheus
spec:
serviceName: "prometheus"
updateStrategy:
type: RollingUpdate
replicas: 3
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
thanos-store-api: "true"
spec:
serviceAccountName: prometheus
volumes:
- name: prometheus-config
configMap:
name: prometheus-config
- name: prometheus-data
hostPath:
path: /data/prometheus
- name: prometheus-config-shared
emptyDir: {}
containers:
- name: prometheus
image: prom/prometheus:v2.11.1
args:
- --config.file=/etc/prometheus-shared/prometheus.yml
- --web.enable-lifecycle
- --storage.tsdb.path=/data/prometheus
- --storage.tsdb.retention=2w
- --storage.tsdb.min-block-duration=2h
- --storage.tsdb.max-block-duration=2h
- --web.enable-admin-api
ports:
- name: http
containerPort: 9090
volumeMounts:
- name: prometheus-config-shared
mountPath: /etc/prometheus-shared
- name: prometheus-data
mountPath: /data/prometheus
livenessProbe:
httpGet:
path: /-/healthy
port: http
- name: watch
image: watch
args: ["-v", "-t", "-p=/etc/prometheus-shared", "curl", "-X", "POST", "--fail", "-o", "-", "-sS", "http://localhost:9090/-/reload"]
volumeMounts:
- name: prometheus-config-shared
mountPath: /etc/prometheus-shared
- name: thanos
image: improbable/thanos:v0.6.0
command: ["/bin/sh", "-c"]
args:
- PROM_ID=`echo $POD_NAME| rev | cut -d '-' -f1` /bin/thanos sidecar
--prometheus.url=http://localhost:9090
--reloader.config-file=/etc/prometheus/prometheus.yml.tmpl
--reloader.config-envsubst-file=/etc/prometheus-shared/prometheus.yml
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
ports:
- name: http-sidecar
containerPort: 10902
- name: grpc
containerPort: 10901
volumeMounts:
- name: prometheus-config
mountPath: /etc/prometheus
- name: prometheus-config-shared
mountPath: /etc/prometheus-shared
Because Prometheus cannot access cluster resources in Kubernetes by default, it needs to be assigned RBAC:
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: prometheus
namespace: default
labels:
app: prometheus
rules:
- apiGroups: [""]
resources: ["services", "pods", "nodes", "nodes/proxy", "endpoints"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["create"]
- apiGroups: [""]
resources: ["configmaps"]
resourceNames: ["prometheus-config"]
verbs: ["get", "update", "delete"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: prometheus
namespace: default
labels:
app: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: default
roleRef:
kind: ClusterRole
name: prometheus
apiGroup: ""
Then the deployment of Thanos Querier is relatively simple. You need to specify the store parameter as dnssrv+thanos-store-gateway.default.svc at startup to discover the Sidecar:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: thanos-query
name: thanos-query
spec:
replicas: 2
selector:
matchLabels:
app: thanos-query
minReadySeconds: 5
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
template:
metadata:
labels:
app: thanos-query
spec:
containers:
- args:
- query
- --log.level=debug
- --query.timeout=2m
- --query.max-concurrent=20
- --query.replica-label=replica
- --query.auto-downsampling
- --store=dnssrv+thanos-store-gateway.default.svc
- --store.sd-dns-interval=30s
image: improbable/thanos:v0.6.0
name: thanos-query
ports:
- containerPort: 10902
name: http
- containerPort: 10901
name: grpc
livenessProbe:
httpGet:
path: /-/healthy
port: http
---
apiVersion: v1
kind: Service
metadata:
labels:
app: thanos-query
name: thanos-query
spec:
type: LoadBalancer
ports:
- name: http
port: 10901
targetPort: http
selector:
app: thanos-query
---
apiVersion: v1
kind: Service
metadata:
labels:
thanos-store-api: "true"
name: thanos-store-gateway
spec:
type: ClusterIP
clusterIP: None
ports:
- name: grpc
port: 10901
targetPort: grpc
selector:
thanos-store-api: "true"
Deploy Thanos Ruler:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: thanos-rule
name: thanos-rule
spec:
replicas: 1
selector:
matchLabels:
app: thanos-rule
template:
metadata:
labels:
labels:
app: thanos-rule
spec:
containers:
- name: thanos-rule
image: improbable/thanos:v0.6.0
args:
- rule
- --web.route-prefix=/rule
- --web.external-prefix=/rule
- --log.level=debug
- --eval-interval=15s
- --rule-file=/etc/rules/thanos-rule.yml
- --query=dnssrv+thanos-query.default.svc
- --alertmanagers.url=dns+http://alertmanager.default
ports:
- containerPort: 10902
name: http
volumeMounts:
- name: thanos-rule-config
mountPath: /etc/rules
volumes:
- name: thanos-rule-config
configMap:
name: thanos-rule-config
Deploy Pushgateway:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: pushgateway
name: pushgateway
spec:
replicas: 15
selector:
matchLabels:
app: pushgateway
template:
metadata:
labels:
app: pushgateway
spec:
containers:
- image: prom/pushgateway:v1.0.0
name: pushgateway
ports:
- containerPort: 9091
name: http
resources:
limits:
memory: 1Gi
requests:
memory: 512Mi
---
apiVersion: v1
kind: Service
metadata:
labels:
app: pushgateway
name: pushgateway
spec:
type: LoadBalancer
ports:
- name: http
port: 9091
targetPort: http
selector:
app: pushgateway
Deploy Alertmanager:
apiVersion: apps/v1
kind: Deployment
metadata:
name: alertmanager
spec:
replicas: 3
selector:
matchLabels:
app: alertmanager
template:
metadata:
name: alertmanager
labels:
app: alertmanager
spec:
containers:
- name: alertmanager
image: prom/alertmanager:latest
args:
- --web.route-prefix=/alertmanager
- --config.file=/etc/alertmanager/config.yml
- --storage.path=/alertmanager
- --cluster.listen-address=0.0.0.0:8001
- --cluster.peer=alertmanager-peers.default:8001
ports:
- name: alertmanager
containerPort: 9093
volumeMounts:
- name: alertmanager-config
mountPath: /etc/alertmanager
- name: alertmanager
mountPath: /alertmanager
volumes:
- name: alertmanager-config
configMap:
name: alertmanager-config
- name: alertmanager
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
labels:
name: alertmanager-peers
name: alertmanager-peers
spec:
type: ClusterIP
clusterIP: None
selector:
app: alertmanager
ports:
- name: alertmanager
protocol: TCP
port: 9093
targetPort: 9093
Finally, deploy ingress:
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: pushgateway-ingress
annotations:
kubernetes.io/ingress.class: "nginx"
nginx.ingress.kubernetes.io/upstream-hash-by: "$request_uri"
nginx.ingress.kubernetes.io/ssl-redirect: "false"
spec:
rules:
- host: $(DOMAIN)
http:
paths:
- backend:
serviceName: pushgateway
servicePort: 9091
path: /metrics
---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: prometheus-ingress
annotations:
kubernetes.io/ingress.class: "nginx"
spec:
rules:
- host: $(DOMAIN)
http:
paths:
- backend:
serviceName: thanos-query
servicePort: 10901
path: /
- backend:
serviceName: alertmanager
servicePort: 9093
path: /alertmanager
- backend:
serviceName: thanos-rule
servicePort: 10092
path: /rule
- backend:
serviceName: grafana
servicePort: 3000
path: /grafana
Visit the Prometheus address, the monitoring node status is normal: