Basic architecture
Prometheus is released by SoundCloud and is a combination of open source monitoring, alarming and time series databases developed in the go language.
The basic principle of Prometheus is to periodically capture the status of monitored components through the HTTP protocol. Any component can be accessed for monitoring as long as it provides the corresponding HTTP interface. No SDK or other integration process is required. This is very suitable for virtualization environment monitoring systems, such as VM, Docker, Kubernetes, etc.
The main component functions of Prometheus are as follows:
- Prometheus Server: The main role of the server is to regularly pull data from statically configured targets or service discovery targets (mainly DNS, consul, k8s, mesos, etc.).
- Exporter: Mainly responsible for reporting data to prometheus server. Different data reports are implemented by different exporters. For example, the monitoring host has node-exporters, and mysql has MySQL server exporter.
- Pushgateway: Prometheus can obtain data by not only going to the corresponding exporter to Pull, but also by having the service push to the pushgateway first, and then the server can go to the pushgateway to pull it.
- Alertmanager: implements the alarm function of prometheus.
- webui: webui display is mainly implemented through grafana.
Our basic process in actual use is:
each service pushes monitoring data to its corresponding indicator (such as the Exporter mentioned below) --> Prometheus Server collects data regularly and stores it --> Configure Grafana to display data & configure alarm rules Alarm
Helm deploys Prometheus platform
Use helm to deploy kube-prometheus-stack
helm address: portal
github address: portal
First, you need to install the helm tool on the server. I won’t go into details on how to install it. There are many tutorials on the Internet. The specific operations of using helm to install prometheus are:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install [RELEASE_NAME] prometheus-community/kube-prometheus-stack
Exporter
To collect target monitoring data, you must first install a collection component at the target location. This collection component is called an Exporter. There are many such exporters on the prometheus.io official website, including the official exporter list .
How to transfer to Prometheus after collection?
Exporter will expose an HTTP interface, prometheus will pull data through Pull mode, and will periodically capture monitored component data through HTTP protocol.
However, prometheus also provides a way to support Push mode. You can push data to Push Gateway, and prometheus obtains data from Push Gateway through pull.
Access collection components in golang applications
kratos framework
An example of connecting the Prometheus collection component to the microservice framework kratos, kratos official tutorial :
package main
import (
"context"
"fmt"
"log"
prom "github.com/go-kratos/kratos/contrib/metrics/prometheus/v2"
"github.com/go-kratos/kratos/v2/middleware/metrics"
"github.com/prometheus/client_golang/prometheus/promhttp"
"github.com/go-kratos/examples/helloworld/helloworld"
"github.com/go-kratos/kratos/v2"
"github.com/go-kratos/kratos/v2/transport/grpc"
"github.com/go-kratos/kratos/v2/transport/http"
"github.com/prometheus/client_golang/prometheus"
)
// go build -ldflags "-X main.Version=x.y.z"
var (
// Name is the name of the compiled software.
Name = "metrics"
// Version is the version of the compiled software.
// Version = "v1.0.0"
_metricSeconds = prometheus.NewHistogramVec(prometheus.HistogramOpts{
Namespace: "server",
Subsystem: "requests",
Name: "duration_sec",
Help: "server requests duration(sec).",
Buckets: []float64{
0.005, 0.01, 0.025, 0.05, 0.1, 0.250, 0.5, 1},
}, []string{
"kind", "operation"})
_metricRequests = prometheus.NewCounterVec(prometheus.CounterOpts{
Namespace: "client",
Subsystem: "requests",
Name: "code_total",
Help: "The total number of processed requests",
}, []string{
"kind", "operation", "code", "reason"})
)
// server is used to implement helloworld.GreeterServer.
type server struct {
helloworld.UnimplementedGreeterServer
}
// SayHello implements helloworld.GreeterServer
func (s *server) SayHello(ctx context.Context, in *helloworld.HelloRequest) (*helloworld.HelloReply, error) {
return &helloworld.HelloReply{
Message: fmt.Sprintf("Hello %+v", in.Name)}, nil
}
func init() {
prometheus.MustRegister(_metricSeconds, _metricRequests)
}
func main() {
grpcSrv := grpc.NewServer(
grpc.Address(":9000"),
grpc.Middleware(
metrics.Server(
metrics.WithSeconds(prom.NewHistogram(_metricSeconds)),
metrics.WithRequests(prom.NewCounter(_metricRequests)),
),
),
)
httpSrv := http.NewServer(
http.Address(":8000"),
http.Middleware(
metrics.Server(
metrics.WithSeconds(prom.NewHistogram(_metricSeconds)),
metrics.WithRequests(prom.NewCounter(_metricRequests)),
),
),
)
httpSrv.Handle("/metrics", promhttp.Handler())
s := &server{
}
helloworld.RegisterGreeterServer(grpcSrv, s)
helloworld.RegisterGreeterHTTPServer(httpSrv, s)
app := kratos.New(
kratos.Name(Name),
kratos.Server(
httpSrv,
grpcSrv,
),
)
if err := app.Run(); err != nil {
log.Fatal(err)
}
}
Finally, an http://127.0.0.1:8000/metrics
HTTP interface is exposed, through which Prometheus can pull monitoring data.
Gin framework
Example of connecting the Prometheus collection component to the lightweight HTTP framework Gin:
package main
import (
"strconv"
"time"
"github.com/gin-gonic/gin"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
handler = promhttp.Handler()
_metricSeconds = prometheus.NewHistogramVec(prometheus.HistogramOpts{
Namespace: "server",
Subsystem: "requests",
Name: "duration_sec",
Help: "server requests duration(sec).",
Buckets: []float64{
0.005, 0.01, 0.025, 0.05, 0.1, 0.250, 0.5, 1},
}, []string{
"method", "path"})
_metricRequests = prometheus.NewCounterVec(prometheus.CounterOpts{
Namespace: "client",
Subsystem: "requests",
Name: "code_total",
Help: "The total number of processed requests",
}, []string{
"method", "path", "code"})
)
func init() {
prometheus.MustRegister(_metricSeconds, _metricRequests)
}
func HandlerMetrics() func(c *gin.Context) {
return func(c *gin.Context) {
handler.ServeHTTP(c.Writer, c.Request)
}
}
func WithProm() gin.HandlerFunc {
return func(c *gin.Context) {
var (
method string
path string
code int
)
startTime := time.Now()
method = c.Request.Method
path = c.Request.URL.Path
c.Next()
code = c.Writer.Status()
_metricSeconds.WithLabelValues(method, path).Observe(time.Since(startTime).Seconds())
_metricRequests.WithLabelValues(method, path, strconv.Itoa(code)).Inc()
}
}
func main() {
r := gin.Default()
r.Use(WithProm())
r.GET("/ping", func(c *gin.Context) {
c.JSON(200, gin.H{
"message": "pong",
})
})
r.GET("/metrics", HandlerMetrics())
r.Run() // 监听并在 0.0.0.0:8080 上启动服务
}
Finally, an http://127.0.0.1:8080/metrics
HTTP interface is exposed, through which Prometheus can pull monitoring data.
Fetch external data sources from the cluster
helm
Background: One was deployed in an existing K8s clusterkube-prometheus-stack
to monitor servers and services. Now the node, pod and other components in the k8s cluster have been connected to prometheus. It is also necessary to connect other application services deployed outside the k8s cluster to prometheus.
When prometheus captures data outside the k8s cluster, there are the following ways:
- ServiceMonitor
- Additional Scrape Configuration
ServiceMonitor
ServiceMonitor is a CRD that defines the service endpoints that Prometheus should crawl and the crawling interval.
To monitor services outside the cluster through ServiceMonitor, you need to configure Service, Endpoints and ServiceMonitor.
192.168.1.100:8000
There is now a backend service that has been deployed and has /metrics
been exposed through monitoring metrics. Try to connect it to prometheus, the specific operations are as follows:
Enter at the command line
$ touch external-application.yaml
$ vim external-application.yaml
Then copy the contents of the following yaml file into it
---
apiVersion: v1
kind: Service
metadata:
name: external-application-exporter
namespace: monitoring
labels:
app: external-application-exporter
app.kubernetes.io/name: application-exporter
spec:
type: ClusterIP
ports:
- name: metrics
port: 9101
protocol: TCP
targetPort: 9101
---
apiVersion: v1
kind: Endpoints
metadata:
name: external-application-exporter
namespace: monitoring
labels:
app: external-application-exporter
app.kubernetes.io/name: application-exporter
subsets:
- addresses:
- ip: 192.168.1.100 # 这里是外部的资源列表
ports:
- name: metrics
port: 8000
- addresses:
- ip: 192.168.1.100 # 这里是外部的资源列表2
ports:
- name: metrics
port: 8080
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: external-application-exporter
namespace: monitoring
labels:
app: external-application-exporter
release: prometheus
spec:
selector:
matchLabels: # Service选择器
app: external-application-exporter
namespaceSelector: # Namespace选择器
matchNames:
- monitoring
endpoints:
- port: metrics # 采集节点端口(svc定义)
interval: 10s # 采集频率根据实际需求配置,prometheus默认10s
path: /metrics # 默认地址/metrics
After saving the file, run the command:
kubectl apply -f external-application.yaml
Then open the prometheus console and enter the Targets directory. You can see that the new external-application-exporter is displayed:
Additional Scrape Configuration
In addition to the HTTP service provided by ip plus port, I have also deployed HTTPS services on other servers that can be accessed through domain names. Now I want to connect it using the same method.
First try to modify it Endpoints
and find the official documentation of k8s . I find that Endpoints
it only supports it ip
and there is no HTTPS
place to configure the protocol.
So let's try another approach.
the first method
First, check the official documentation and find the place about prometheus crawling configuration . You can see that the keyword of prometheus crawling configuration is. scrape_config
Our prometheus is obtained by deploying kube-prometheus-stack through helm, so let’s check the charts. value.yaml file to see if it is configured.
input the command:
$ cat values.yaml | grep -C 20 scrape_config
The following results are obtained:
As we know from the comments, kube-prometheus configures the crawling strategy through additionalScrapeConfigs.
So I wrote a configuration file to update the release of prometheus that helm had deployed.
$ touch prometheus.yml
$ vim prometheus.yml
Write the following content:
prometheus:
prometheusSpec:
additionalScrapeConfigs:
- job_name: external-application-exporter-https
scrape_interval: 10s
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
tls_config:
insecure_skip_verify: true
static_configs:
- targets: ["www.baidu.com:443"]
Last updated release:
$ helm upgrade -nmonitoring -f prometheus.yaml prometheus kube-prometheus-stack-40.0.0.tgz
Use prometheus.yaml
the updated release, which is the chartkube-prometheus-stack-40.0.0.tgz
file that I have helm pulled to the local when deploying prometheus .
We can see our newly added data source in the Targets directory of the prometheus console.
It can actually end here, but one disadvantage is that every time a new domain name is added for monitoring, the release of helm needs to be updated again , which is not particularly convenient.
The second method
Looking through the source code of prometheus-operator , I found that in the description, there is a tutorial on capturing configuration hot updates. A simple summary is to control the data source captured by prometheus by configuring the secret. When the content of the secret is modified, the prometheus fetching configuration can be hot updated. Take a screenshot to see:
The first step is to generate prometheus-additional.yaml
the file
$ touch prometheus-additional.yaml
$ vim prometheus-additional.yaml
prometheus-additional.yaml
content:
- job_name: external-application-exporter-https
scrape_interval: 10s
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
tls_config:
insecure_skip_verify: true
static_configs:
- targets: ["www.baidu.com:443"]
The second step is to generate secret
Generate the configuration file used to create the secret:
$ kubectl create secret generic additional-scrape-configs --from-file=prometheus-additional.yaml --dry-run=client -oyaml > additional-scrape-configs.yaml
$ cat additional-scrape-configs.yaml
You can see the generated additional-scrape-configs.yaml
content as follows:
apiVersion: v1
data:
prometheus-additional.yaml: LSBqb2JfbmFtZTogZXh0ZXJuYWwtYXBwbGljYXRpb24tZXhwb3J0ZXItaHR0cHMKICBzY3JhcGVfaW50ZXJ2YWw6IDEwcwogIHNjcmFwZV90aW1lb3V0OiAxMHMKICBtZXRyaWNzX3BhdGg6IC9tZXRyaWNzCiAgc2NoZW1lOiBodHRwcwogIHRsc19jb25maWc6CiAgICBpbnNlY3VyZV9za2lwX3ZlcmlmeTogdHJ1ZQogIHN0YXRpY19jb25maWdzOgogICAgLSB0YXJnZXRzOiBbImNpYW10ZXN0LnNtb2EuY2M6NDQzIl0K
kind: Secret
metadata:
creationTimestamp: null
name: additional-scrape-configs
Decode this code and take a look at the content:
$ echo "LSBqb2JfbmFtZTogZXh0ZXJuYWwtYXBwbGljYXRpb24tZXhwb3J0ZXItaHR0cHMKICBzY3JhcGVfaW50ZXJ2YWw6IDEwcwogIHNjcmFwZV90aW1lb3V0OiAxMHMKICBtZXRyaWNzX3BhdGg6IC9tZXRyaWNzCiAgc2NoZW1lOiBodHRwcwogIHRsc19jb25maWc6CiAgICBpbnNlY3VyZV9za2lwX3ZlcmlmeTogdHJ1ZQogIHN0YXRpY19jb25maWdzOgogICAgLSB0YXJnZXRzOiBbImNpYW10ZXN0LnNtb2EuY2M6NDQzIl0K" | base64 -d
get:
- job_name: external-application-exporter-https
scrape_interval: 10s
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
tls_config:
insecure_skip_verify: true
static_configs:
- targets: ["www.baidu.com:443"]
You can confirm that the configuration file is generated correctly, and then generate the secret:
$ kubectl apply -f additional-scrape-configs.yaml -n monitoring
Monitoring is the namespace where prometheus is deployed, put them in the same namespace.
Confirm that the secret is generated:
$ kubectl get secret -n monitoring
Output:
Finally, modify the CRD
Finally, reference this additional configuration in your prometheus.yaml CRD.
The official documentation allows us to modify the configuration of prometheus.
First find the CRD of prometheus:
$ kubectl get prometheus -n monitoring
NAME VERSION REPLICAS AGE
prometheus-kube-prometheus-prometheus v2.38.0 1 2d18h
then modify it
$ kubectl edit prometheus prometheus-kube-prometheus-prometheus -n monitoring
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus
labels:
prometheus: prometheus
spec:
...
additionalScrapeConfigs:
name: additional-scrape-configs
key: prometheus-additional.yaml
...
Finally, check the effect in the prometheus console:
the domain name service has been monitored. If you want to add other domain name monitoring in the future, you only need to modify the secret. Great! ! !
Alarm
Regarding alarms, we use the prometheus+alertmanager solution. The main process from monitoring alarm information to handling alarm events is as follows:
Our business requirement is to receive notifications when the service is down and deal with it in a timely manner. So the alarm rules we need to configure here are to collect the survival information of the application. When the non-survival state is detected, the alarm message status is set to peding
. When the peding duration reaches a certain time threshold, it is set to firing
trigger an alarm. The alarm information is submitted alertmanager
to AlertManager, and then according to the rules, the alarm message is sent to 消息接收者
Qiwei, DingTalk, email, etc.
The specific methods are as follows:
Step 1 prometheus alarm trigger
Reference: kube-prometheus-stack alarm configuration
Since I deployed with helm kube-prometheus-stack
, in order to maintain version consistency, charts: kube-prometheus-stack-40.0.0.tgz
were downloaded ( helm pull prometheus-community/kube-prometheus-stack --version=40.0.0
) to the local in advance. After decompression, you can find the following relevant entries kube-prometheus-stack
in :values.yaml
PrometheusRules
## Deprecated way to provide custom recording or alerting rules to be deployed into the cluster.
##
# additionalPrometheusRules: []
# - name: my-rule-file
# groups:
# - name: my_group
# rules:
# - record: my_record
# expr: 100 * my_record
## Provide custom recording or alerting rules to be deployed into the cluster.
##
#additionalPrometheusRulesMap: {}
# rule-name:
# groups:
# - name: my_group
# rules:
# - record: my_record
# expr: 100 * my_record
Modification values.yaml
:
## Deprecated way to provide custom recording or alerting rules to be deployed into the cluster.
##
# additionalPrometheusRules: []
# - name: my-rule-file
# groups:
# - name: my_group
# rules:
# - record: my_record
# expr: 100 * my_record
## Provide custom recording or alerting rules to be deployed into the cluster.
##
additionalPrometheusRulesMap:
rule-name:
groups:
- name: Instance
rules:
# Alert for any instance that is unreachable for >5 minutes.
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: page
annotations:
summary: "Instance {
{ $labels.instance }} down"
description: "{
{ $labels.instance }} of job {
{ $labels.job }} has been down for more than 5 minutes."
Then update helm release
helm upgrade -nmonitoring prometheus --values=values.yaml ../kube-prometheus-stack-40.0.0.tgz
After the update is completed, check the results on the prometheus console:
you can see that alert rules
the configuration has been successful. According to the alarm rules, as long as the status of any instance is notup == 0
, the alert status will be changed to peding according to the rules. If it has not recovered after 5 minutes, the status will Change to firing to trigger an alarm message.
Step 2 alertmanager alarm notification
After the prometheus trigger collects the alarm message, it will be sent to alertmanager for unified management. alertmanager configures certain rules to distribute alert messages to different recipients. Find the following relevant entries
in . Specified configurations are provided so that you can customize some specific ones . The original configuration is as follows:kube-prometheus-stack
values.yaml
alertmanager.config
alertmanager.config
altermanager
receivers
## Configuration for alertmanager
## ref: https://prometheus.io/docs/alerting/alertmanager/
##
alertmanager:
...
## Alertmanager configuration directives
## ref: https://prometheus.io/docs/alerting/configuration/#configuration-file
## https://prometheus.io/webtools/alerting/routing-tree-editor/
##
config:
global:
resolve_timeout: 5m
inhibit_rules:
- source_matchers:
- 'severity = critical'
target_matchers:
- 'severity =~ warning|info'
equal:
- 'namespace'
- 'alertname'
- source_matchers:
- 'severity = warning'
target_matchers:
- 'severity = info'
equal:
- 'namespace'
- 'alertname'
- source_matchers:
- 'alertname = InfoInhibitor'
target_matchers:
- 'severity = info'
equal:
- 'namespace'
route:
group_by: ['namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'null'
routes:
- receiver: 'null'
matchers:
- alertname =~ "InfoInhibitor|Watchdog"
receivers:
- name: 'null'
templates:
- '/etc/alertmanager/config/*.tmpl'
We modify it to:
## Configuration for alertmanager
## ref: https://prometheus.io/docs/alerting/alertmanager/
##
alertmanager:
...
## Alertmanager configuration directives
## ref: https://prometheus.io/docs/alerting/configuration/#configuration-file
## https://prometheus.io/webtools/alerting/routing-tree-editor/
##
config:
global:
resolve_timeout: 5m
inhibit_rules:
- source_matchers:
- 'severity = critical'
target_matchers:
- 'severity =~ warning|info'
equal:
- 'namespace'
- 'alertname'
- source_matchers:
- 'severity = warning'
target_matchers:
- 'severity = info'
equal:
- 'namespace'
- 'alertname'
- source_matchers:
- 'alertname = InfoInhibitor'
target_matchers:
- 'severity = info'
equal:
- 'namespace'
route:
group_by: ['instance']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'wx-webhook'
routes:
receivers:
- name: 'wx-webhook'
webhook_configs:
- url: "http://wx-webhook:80/adapter/wx"
send_resolved: true
templates:
- '/etc/alertmanager/config/*.tmpl'
The address in it webhook_configs[0].url: "http://wx-webhook:80/adapter/wx"
is the enterprise WeChat group robot webhook that receives the alarm message. The establishment of the enterprise WeChat group robot webhook will be explained in detail next.
Then update helm release
helm upgrade -nmonitoring prometheus --values=values.yaml ../kube-prometheus-stack-40.0.0.tgz
After the configuration is completed, turn off a service and view the results in the enterprise WeChat group:
Step 3: Build an enterprise WeChat group robot webhook
Reference: prometheus alarms through enterprise WeChat robots
Generate a Qiwei robot
In the group settings, enter the group robot function: then add a group robot and copy the address
of the added group robotWebhook
Write deployment
configuration file wx-webhook-deployment.yaml
:
apiVersion: apps/v1
kind: Deployment
metadata:
name: wx-webhook
labels:
app: wx-webhook
spec:
replicas: 1
selector:
matchLabels:
app: wx-webhook
template:
metadata:
labels:
app: wx-webhook
spec:
containers:
- name: wx-webhook
image: guyongquan/webhook-adapter:latest
imagePullPolicy: IfNotPresent
args: ["--adapter=/app/prometheusalert/wx.js=/wx=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxxxxxxxxxxxxxxxxxxxxx"]
ports:
- containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
name: wx-webhook
labels:
app: wx-webhook
spec:
selector:
app: wx-webhook
ports:
- name: wx-webhook
port: 80
protocol: TCP
targetPort: 80
nodePort: 30904
type: NodePort
The content is the address of args: ["--adapter=/app/prometheusalert/wx.js=/wx=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxxxxxxxxxxxxxxxxxxxxx"]
the Qiwei robot created in the previous step. Then run the command:Webhook
$ kubectl apply -f wx-webhook-deployment.yaml -nmonitoring
$ kubectl get pod -n monitoring | grep wx-webhook
wx-webhook-78d4dc95fc-9nsjn 1/1 Running 0 26d
$ kubectl get service -n monitoring | grep wx-webhook
wx-webhook NodePort 10.106.111.183 <none> 80:30904/TCP 27d
In this way, the establishment of the enterprise WeChat group robot webhook is completed.
Here I use Enterprise WeChat as the receiver of the alert message, and alertmanager also supports other message receivers. You can refer to this article: Detailed explanation of kube-promethues monitoring alarms (email, DingTalk, WeChat, Qiwei Robot, self-research platform)
Problems encountered
- After updating the capture configuration secret, no effect can be seen on the prometheus console.
Try restarting the pod:prometheus-prometheus-kube-prometheus-prometheus-0
, error:
ts=2023-07-29T09:30:54.188Z caller=main.go:454 level=error msg=“Error loading config (–config.file=/etc/prometheus/config_out/prometheus.env.yaml)” file=/etc/prometheus/config_out/prometheus.env.yaml err=“parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name “external-application-exporter-https””
The reason is that the configuration error of the custom indicator causes prometheus to fail to start, and there are problems with scrape_interval and scrape_timeout.
- job_name: external-application-exporter-https
scrape_interval: 10s
scrape_timeout: 30s
metrics_path: /metrics
scheme: https
tls_config:
insecure_skip_verify: true
static_configs:
- targets: ["www.baidu.com:443"]
Need to be changed to
- job_name: external-application-exporter-https
scrape_interval: 10s
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
tls_config:
insecure_skip_verify: true
static_configs:
- targets: ["www.baidu.com:443"]
Quote
- Getting started with Grafana & prometheus
- Prometheus monitoring + Grafana + Alertmanager alarm installation and use (detailed picture and text explanation)
- Prometheus official tutorial
- Helm repository
- Github address of kube-prometheus project
- kratos official tutorial
- K8s official documentation
- Source code of prometheus-operator
- kube-prometheus-stack alarm configuration
- kube-prometheus-stack configure AlertManager
- prometheus alarms through enterprise WeChat robot
- Detailed explanation of kube-promethues monitoring and alarming (email, DingTalk, WeChat, Qiwei Robot, self-research platform)