In-depth analysis of cloud native installation and deployment and principle analysis of Prometheus

1. Introduction to Prometheus

① Prometheus features

  • Prometheus was originally an open source monitoring and alarm system developed by SoundCloud, and it is an open source version of Google's BorgMon monitoring system. In 2016, Prometheus joined CNCF and became the second project hosted by CNCF after Kubernetes. With the establishment of Kubernetes' leading position in container arrangement, Prometheus has also become the standard configuration for Kubernetes container monitoring.
  • The main features of Prometheus are as follows:
    • Multi-dimensional, time-series data models distinguished by metric names and labels (key/value pairs);
    • Flexible query syntax PromQL;
    • A service node can work without relying on additional storage;
    • Use the http protocol to collect time series data through the pull mode;
    • Applications that require push mode can be implemented through the middleware gateway;
    • Monitoring targets support service discovery and static configuration;
    • Supports a wide variety of chart and dashboard components.

② Prometheus core components

  • The entire Prometheus ecosystem contains multiple components, all of which are optional except the Prometheus server component;
    • Prometheus Server: the main core component used to collect and store time series data;
    • Client Library: The client library generates corresponding metrics for the services that need to be monitored and exposes them to the Prometheus server. When the Prometheus server comes to pull, it directly returns the real-time metrics;
    • push gateway: It is mainly used for short-term jobs. Since such jobs exist for a short time, they may disappear before Prometheus pulls them. For this reason, jobs can directly push their metrics to the Prometheus server. This method is mainly used Metrics at the service level, for metrics at the machine level, you need to use node exporter;
    • Exporters: Used to expose the metrics of existing third-party services to Prometheus;
    • Alertmanager: After receiving alerts from the Prometheus server, it will remove duplicate data, group them, and route them to the receiving method, and send an alarm. Common receiving methods include: email, pagerduty, OpsGenie, webhook, etc.;
    • Various support tools.

③ Prometheus framework

  • Retrieval is responsible for capturing sampling indicator data on the exposed target page at regular intervals;
  • Storage is responsible for writing the sampling data to the specified time series database storage;
  • PromQL is a query language module provided by Prometheus, which can be integrated with some webui such as grfana;
  • Jobs / Exporters: Prometheus can pull monitoring data from Jobs or Exporters, and Exporter exposes the data collection interface in the form of Web API;
  • Prometheus Server: Prometheus can also pull data from other Prometheus Servers;
  • Pushgateway: For some components that run as temporary jobs, Prometheus may not have time to pull the monitoring data from them. These jobs have ended, and the job runtime can push the monitoring data to Pushgateway at runtime, and Prometheus will pull the monitoring data from Pushgateway Pull data to prevent loss of monitoring data.
  • Service discovery: Prometheus can dynamically discover some services and pull data for monitoring, such as from DNS, Kubernetes, Consul, file_sd is a statically configured file;
  • AlertManager: It is an external component independent of Prometheus, which is used to monitor system alerts. Some alert rules can be configured through configuration files, and Prometheus will push alerts to AlertManager.

insert image description here

④ Workflow

  • Prometheus server periodically pulls metrics from configured jobs or exporters, or receives metrics from Pushgateway, or pulls metrics from other Prometheus servers;
  • The Prometheus server stores the collected metrics locally and runs the defined alert.rules to record new time series or push alerts to Alertmanager;
  • Alertmanager processes the received alerts and sends out alerts according to the configuration file;
  • In the graphical interface, visualize the collected data.

2. Deployment of Prometheus

① install git

$ yum install -y git
  • Download kube-prometheus:
# 下载
$ git clone https://github.com/prometheus-operator/kube-prometheus.git
$ cd kube-prometheus/manifests
# yaml文件比较多,进行归档
$ mkdir -p serviceMonitor prometheus adapter node-exporter blackbox kube-state-metrics grafana alertmanager operator other/{
    
    nfs-storage,ingress}
$ mv alertmanager-* alertmanager/ && mv blackbox-exporter-* blackbox/ &&  mv grafana-* grafana/ && mv kube-state-metrics-* kube-state-metrics/ && mv node-exporter-*  node-exporter/ && mv prometheus-adapter-* adapter/ && mv prometheus-* prometheus/ && mv kubernetes-serviceMonitor* serviceMonitor/
$ 

② Modify the mirror source

  • Some images from foreign mirror sources cannot be pulled. Here, modify the mirror sources of prometheus-operator, prometheus, alertmanager, kube-state-metrics, node-exporter, and prometheus-adapter to domestic mirror sources:
# 查找
$ grep -rn 'quay.io' *
# 批量替换
$ sed -i 's/quay.io/quay.mirrors.ustc.edu.cn/g' `grep "quay.io" -rl *`
# 再查找
$ grep -rn 'quay.io' *
$ grep -rn 'image: ' *

insert image description here

③ Change the type to NodePort

  • In order to access prometheus, alertmanager, and grafana from the outside, modify the service type of promethes, alertmanager, and grafana to NodePort type;
  • Modify the service of prometheus:
# 设置对外访问端口:30080
$ cat prometheus-service.yaml

insert image description here

  • Modify grafana's service:
# 设置对外访问端口:30081
$ cat grafana-service.yaml

insert image description here

  • Modify the service of alertmanager:
# 设置对外访问端口:30082
$ cat alertmanager-service.yaml

insert image description here

  • Install CRD and prometheus-operator:
$ kubectl apply -f setup/
  • It takes a few minutes to download the prometheus-operator image, just wait until prometheus-operator becomes running:
$ kubectl get pod -n monitoring

insert image description here

  • Install prometheus, alertmanager, grafana, kube-state-metrics, node-exporter and other resources:
$ kubectl apply -f .
  • Wait for a while and check again, check the pod status under the monitoring namespace until all pods under the monitoring namespace change to the running status, and you're done:
$ kubectl get pod -n monitoring

insert image description here

  • Workflow: Prometheus Server periodically pulls metrics from the configured Exporters/Jobs, or the metrics sent by the pushgateway, or other metrics, runs the defined alert.rules after collection, records time series or pushes alerts to Alertmanager .
  • Component Description:
    • node_exporter: used to monitor the resource information of the host on the computing node, and needs to be deployed to all computing nodes;
    • kube-state-metric: The exporter that prometheus collects k8s resource data can collect the relevant data of most k8s built-in resources, such as pod, deploy, service, etc., and it also provides its own data, mainly the number and collection of resources Statistics on the number of exceptions that occurred;
    • blackbox_exporter: monitor the survivability of business containers;
    • prometheus-adapter: Since prometheus itself is a third-party solution, the native k8s system cannot analyze the custom indicators of Prometheus, so it is necessary to use k8s-prometheus-adapter to convert these indicator data query interfaces into standard Kubernetes auto- define indicators;
  • verify:
$ kubectl get svc -n monitoring

insert image description here

  • Browser access:
    • prometheus:http://192.168.0.113:30080/:

insert image description here

    • grafana: http://192.168.0.113:30081/login, default account/password: admin/admin:

insert image description here

    • alertmanager:http://192.168.0.113:30082/:

insert image description here

  • Grafana adds a data source and modifies the prometheus address:

insert image description here
insert image description here

insert image description here

    • kubernetes template import (k8s template):

insert image description here
insert image description here
insert image description here
insert image description here

3. Prometheus related concepts

① Internal storage mechanism

  • Prometheus has a very efficient time series data storage method. Each sampled data only occupies about 3.5 bytes of space. Millions of time series, 30-second intervals, and 60-day retention, cost more than 200 G.
  • Prometheus is mainly divided into three major blocks:
    • Retrieval is responsible for capturing sampling indicator data on the exposed target page at regular intervals;
    • Storage is responsible for writing sampled data to disk;
    • PromQL is a query language module provided by Prometheus.

insert image description here

② Data model

  • All data stored by Prometheus is time series data (Time Serie Data, referred to as time series data), time series data is a data stream with a time stamp, the data stream belongs to a metric (Metric) and multiple tags under the metric (Label):

insert image description here

  • Each Metric name represents a type of indicator, and they can carry different Labels. Each Metric name + Label is combined to represent a time series of data. In Prometheus, all values ​​are 64bit, and what is recorded in each time series is actually 64bit timestamp (timestamp) + 64bit value (sampling value):
    • Metric name (index name): The name should have semantics and is generally used to represent the function of the metric, for example: http_requests_total, which represents the total number of http requests, where the metric name is composed of ASCII characters, numbers, underscores, and colons, and must satisfy Regular Expression [a-zA-Z_:][a-zA-Z0-9_:]*;
    • Labels (labels): Make the same time series have different dimensions to identify, for example, http_requests_total{method="Get"} means Get requests in all http requests. When method="post", it is a new metric. The key in the label is composed of ASCII characters, numbers, and underscores, and must satisfy the regular expression [a-zA-Z_:][a-zA-Z0- 9_:]*;
    • timestamp (time stamp): the time of the data point, indicating the time of data recording;
    • Sample Value: The actual time series, each series includes a float64 value and a millisecond-level timestamp.
http_requests_total{
    
    status="200",method="GET"}
http_requests_total{
    
    status="404",method="GET"}
  • According to the above analysis, the storage of time series seems to be designed as key-value storage (based on BigTable):

insert image description here

  • Further splitting, it can look like this:

insert image description here

  • The second style in the above figure is the current internal representation of Prometheus, and name is a specific label tag, representing the metric name.
  • Let's review the overall process of Prometheus:

insert image description here

  • The KV storage is mentioned, of course, it uses the LevelDB engine, which is characterized by very high sequential read and write performance, which is very consistent with time-series storage.

③Metric type

  • Prometheus defines 4 different metric types: Counter (counter), Gauge (dashboard), Histogram (histogram), Summary (summary).
  • Counter:
    • A cumulative metric, typical applications such as: the number of requests, the number of completed tasks, the number of errors, etc.;
    • For example: query http_requests_total{method="get", job="Prometheus", handler="query"} returns 8, after 10 seconds, query again and return 14.
  • Gauge (dashboard):
    • The data is an instantaneous value, if the current memory usage, it changes over time;
    • For example: go_goroutines{instance="172.17.0.2", job="Prometheus"} returns 147, and returns 124 after 10 seconds.
  • Histogram (histogram):
    • Histogram samples the results of observations (usually request duration or response size) and calculates these results in a configurable distribution interval (bucket), which also provides the sum of all observations;
    • Histogram has a basic metric name, showing multiple time series in one capture: the accumulated counter represents the observation interval: _bucket{le=""}, the total number of all observations: _sum, and the number of observed events: _count.
    • For example: prometheus_local_storage_series_chunks_persisted in the Prometheus server indicates the number of chunks that need to be stored for each time series in Prometheus, which can be used to calculate the quantile of the data to be persisted;
  • Summary:
    • Similar to histogram, summary samples the results of observations (usually request duration or response size), but it also provides the number of observations and the sum of all values, which calculates configurable quantiles over a sliding time window;
    • Summary has a basic metric name that represents multiple time series in one fetch: Streaming φ-quantiles of observed events (0 ≤ φ ≤ 1): {quantile="φ"}, sum of all observations : _sum, the number of observed events: _count;
    • 如:Prometheus server 中 prometheus_target_interval_length_seconds。

④ Comparison between Histogram and Summary

serial number histogram Summary
configuration interval configuration Quantiles and sliding windows
client performance Only need to increase the cost of counters is small Requires high cost of stream computing
server performance Computing quantiles is expensive and may be time consuming No calculation required, low cost
Timing Quantity _sum、_count、bucket _sum、_count、quantile
quantile error The size of the bucket is related to The configuration of φ is related to
φ and the sliding window Prometheus expression settings client settings
polymerization Aggregate by expression generally non-aggregable
  • The following is an example of sample output for types histogram and summary:
# A histogram, which has a pretty complex representation in the text format:
# HELP http_request_duration_seconds A histogram of the request duration.
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{
    
    le="0.05"} 24054
http_request_duration_seconds_bucket{
    
    le="0.1"} 33444
http_request_duration_seconds_bucket{
    
    le="0.2"} 100392
http_request_duration_seconds_bucket{
    
    le="+Inf"} 144320
http_request_duration_seconds_sum 53423
http_request_duration_seconds_count 144320
# Finally a summary, which has a complex representation, too:
# HELP rpc_duration_seconds A summary of the RPC duration in seconds.
# TYPE rpc_duration_seconds summary
rpc_duration_seconds{
    
    quantile="0.01"} 3102
rpc_duration_seconds{
    
    quantile="0.05"} 3272
rpc_duration_seconds{
    
    quantile="0.5"} 4773
rpc_duration_seconds_sum 1.7560473e+07
rpc_duration_seconds_count 2693

⑤ Task (JOBS) and instance (INSTANCES)

  • In Prometheus terminology, the endpoint that can be grabbed is called an instance, which usually corresponds to a single process;
  • A collection of instances with the same purpose (for example, a process of replication for scalability or reliability) is called a job.
  • An API server job with four replicated instances:
    • job: api-server:
      • instance 1: 1.2.3.4:5670
      • instance 2: 1.2.3.4:5671
      • instance 3: 5.6.7.8:5670
      • instance 4: 5.6.7.8:5671
    • instance: a single scrape target, generally corresponding to a process;
    • jobs: A group of instances of the same type (mainly used to ensure scalability and reliability).

⑥ Node exporter

  • Node exporter is mainly used to expose metrics to Prometheus, where metrics include: cpu load, memory usage, network, etc.

⑦ Pushgateway

  • Pushgateway is an important tool in the Prometheus ecosystem. The main reasons for using it are:
    • Prometheus adopts the pull mode, which may cause Prometheus to be unable to directly pull each target data because it is not in a subnet or a firewall;
    • When monitoring business data, it is necessary to summarize different data and collect them uniformly by Prometheus;
  • Due to the above reasons, pushgateway has to be used, but before using it, it is necessary to understand some of its disadvantages:
    • Summarize the data of multiple nodes to the pushgateway. If the pushgateway is down, the impact will be greater than that of multiple targets;
    • Prometheus pull status up is only for pushgateway, and cannot be effective for every node;
    • Pushgateway can persist all the monitoring data pushed to it, so even if the monitoring is offline, prometheus will still pull the old monitoring data, and you need to manually clean up the data that pushgateway does not want.

4. Introduction to TSDB

① Characteristics of time series database

  • TSDB (Time Series Database) time series database can be simply understood as an optimized software for processing time series data, and the arrays in the data are indexed by time.
  • TSDB features are:
    • write operations most of the time;
    • Write operations are added almost sequentially, and most of the time data arrives sorted by time;
    • Write operations rarely write data long ago, and rarely update data. In most cases, the data will be written to the database after a few seconds or minutes after the data is collected;
    • The deletion operation is generally block deletion, select the starting historical time and specify the subsequent block, and rarely delete data at a certain time or at a separate random time;
    • The basic data is large, generally exceeding the memory size, and generally only a small part of it is selected and irregular, and the cache has almost no effect;
    • The read operation is a typical ascending or descending sequential read;
    • Highly concurrent read operations are very common.

② Common time series database

TSDB project official website
influxDB https://influxdata.com/
RRDtool http://oss.oetiker.ch/rrdtool/
Graphite http://graphiteapp.org/
OpenTSDB http://opentsdb.net/
kdb+ http://kx.com/
Druid http://druid.io/
KairosDB http://kairosdb.github.io/
Prometheus https://prometheus.io/

Five, PromQL query expression

  • The four data types of PromQL:
    • Instant vector: a set of time series containing a single sample of each time series, sharing the same timestamp;
    • Range vector: a set of time series containing data points within a range;
    • Scalar: a simple numeric floating-point value;
    • String (String): A simple string value, currently unused.

① Instant vector selector

  • The instant vector selector allows to select a set of time series, or sample data at a given timestamp, as shown below selects the time series with http_requests_total:
http_requests_total
  • These time series can be further filtered by appending a set of tags enclosed in {}, as follows to select only time series with the name http_requests_total, with the prometheus job tag, and with the canary group tag:
http_requests_total
  • In addition, it is also possible to inversely match the tag value, or match the tag value against a regular expression. The matching operators are listed below:
    • =: Select exactly equal string tags;
    • !=: Select string tags that are not equal;
    • =~: Select a tag (or sub-tag) that matches a regular expression;
    • !~: select tags (or sub-tags) that do not match the regular expression;
  • Select the time series of http_requests_total of HTTP methods other than GET in staging, testing, and development environments:
http_requests_total

② Range vector selector

  • Range vector expressions work just like instant vector expressions, but the former returns a set of time series for a certain time range starting from the current moment. The syntax is to add [] after a vector expression to indicate the time range, the duration is represented by a number, followed by one of the following units:
    • s:seconds
    • m:minutes
    • h:hours
    • d:days
    • w:weeks
    • y:years
  • In the example shown below, select the records of the last 5 minutes, all the values ​​of the time series whose metric name is http_requests_total and job label is prometheus:
http_requests_total{
    
    job="prometheus"}[5m]

③ Offset Modifier

  • The offset can change the time and range vector offsets for individual moments in the query. For example, the following expression returns the value of http_requests_total for the past 5 minutes relative to the evaluation time of the current query:
http_requests_total offset 5m
  • Same goes for the range vector, this will return http_requests_total the 5-minute rate for a week ago:
rate(http_requests_total[5m] offset 1w)

④ Use aggregation operations

  • The aggregation operation provided by PromQL can be used to process these time series to form a new time series:
# 查询系统所有http请求的总量
sum(http_request_total)

# 按照mode计算主机CPU的平均使用时间
avg(node_cpu) by (mode)

# 按照主机查询各个主机的CPU使用率
sum(sum(irate(node_cpu{
    
    mode!='idle'}[5m]))  / sum(irate(node_cpu[5m]))) by (instance)
  • Common aggregate functions:
    • sum
    • min (minimum value)
    • max (maximum value)
    • avg (average value)
    • stddev (standard deviation)
    • stdvar (standard variance)
    • count (count)
    • count_values ​​(count value)
    • bottomk (last n time series)
    • topk (top n time series)
    • quantile

6. Exporter

  • Exporter is an important part of prometheus monitoring and is responsible for the collection of data indicators. Broadly speaking, all programs that can provide monitoring sample data to Prometheus can be called an Exporter, and an instance of Exporter is called a target.

insert image description here

  • The official plug-ins include blackbox_exporter, node_exporter, mysqld_exporter, snmp_exporter, etc. The third-party plug-ins include redis_exporter, cadvisor, etc., and the official exporter address .

① Common Exporter

  • bloackbox exporter:
  • bloackbox exporter is a black box monitoring solution provided by the prometheus community. It allows users to detect the network through HTTP, HTTPS, DNS, TCP and ICMP, and collect site information through blackbox.
  • node_exporter:
    • node_exporter is mainly used to collect machine performance index data, including basic information such as cpu, memory, disk, and io.
  • mysqld_exporter:
    • mysql_exporter is used to collect MysQL or Mariadb database related indicators, mysql_exporter needs to connect to the database and have relevant permissions.
  • snmp_exporter:
    • SNMP Exporter collects information from SNMP service and provides it to Promethers monitoring system.

② How the Exporter works

  • Independent operation: Since the operating system itself does not directly support Prometheus, and users cannot provide support for Prometheus directly from the operating system level, users can only run a program independently through the relevant interfaces provided by the operating system , to convert the system's running status data into monitoring data that can be read by Prometheus. In addition to Node Exporter, MySQL Exporter, Redis Exporter, etc. are all implemented in this way. These Exporter programs play the role of an intermediate agent (data conversion).
  • Integrate into the application (recommended): In order to better monitor the internal operating status of the system, some open source projects such as Kubernetes, ETCD, etc. directly use the Prometheus Client Library in the code, providing direct support for Prometheus. Breaking the boundaries of monitoring allows the application to directly expose the internal running status to Prometheus, which is suitable for some projects that require more custom monitoring indicators.

③ Exporter specification

  • All Exporter programs need to return the monitored sample data according to the Prometheus specification. Taking Node Exporter as an example, when accessing the /metrics address, the following content will be returned (direct curl cannot get the data, it must be authorized):
# 取前面10行
$ curl -s -k --header "Authorization: Bearer $TOKEN" https://192.168.0.113:6443/metrics|head -10

insert image description here

# HELP aggregator_openapi_v2_regeneration_count [ALPHA] Counter of OpenAPI v2 spec regeneration count broken down by causing APIService name and reason.
# TYPE aggregator_openapi_v2_regeneration_count counter
aggregator_openapi_v2_regeneration_count{
    
    apiservice="*",reason="startup"} 0
aggregator_openapi_v2_regeneration_count{
    
    apiservice="k8s_internal_local_delegation_chain_0000000002",reason="update"} 0
aggregator_openapi_v2_regeneration_count{
    
    apiservice="v1beta1.metrics.k8s.io",reason="add"} 0
aggregator_openapi_v2_regeneration_count{
    
    apiservice="v1beta1.metrics.k8s.io",reason="update"} 0
# HELP aggregator_openapi_v2_regeneration_duration [ALPHA] Gauge of OpenAPI v2 spec regeneration duration in seconds.
# TYPE aggregator_openapi_v2_regeneration_duration gauge
aggregator_openapi_v2_regeneration_duration{
    
    reason="add"} 0.929158077
aggregator_openapi_v2_regeneration_duration{
    
    reason="startup"} 0.509336209
  • The sample data returned by Exporter is mainly composed of three parts: the general annotation information of the sample (HELP), the type annotation information of the sample (TYPE) and the sample.
  • Prometheus will parse the content of the Exporter response line by line:
    • If the current line starts with # HELP, Prometheus will parse the content according to the following rules to get the current indicator name and corresponding description information:
# HELP <metrics_name> <doc_string>
    • If the current line starts with # TYPE , Prometheus will parse the content according to the following rules to get the current indicator name and indicator type:
# TYPE <metrics_name> <metrics_type>
    • The TYPE comment line must appear before the first sample of the indicator. If there is no clear indicator type that needs to be returned as untyped, all lines except # at the beginning will be regarded as monitoring sample data. Each line of samples needs to meet the following format specifications:
metric_name [
"{" label_name "=" " label_value " {
    
     "," label_name "=" " label_value " } [ "," ] "}"
] value [ timestamp ]

④ node-exporter

  • Exporter is the indicator data collection component of Prometheus. It is responsible for collecting data from target Jobs and converting the collected data into the time series data format supported by Prometheus. Different from the traditional indicator data collection component, it is only responsible for collection, and does not send data to the server, but waits for Prometheus Server to actively grab it.
  • node-exporter is used to collect the running indicators of node, including basic monitoring indicators such as node's cpu, load, filesystem, meminfo, network, etc., similar to zabbix-agent of zabbix monitoring system, the schematic diagram is as follows:

insert image description here

  • Check the node-exporter service:
$ kubectl get pods -n monitoring -o wide|grep node-exporter
# 查看pod内的node_exporter进程
$ kubectl exec -it node-exporter-dc65j -n monitoring -- ps -ef|grep node_exporter
# 获取容器ID
$ docker ps |grep node_exporter
# 查看docker 容器的pid
$ docker inspect -f {
    
    {
    
    .State.Pid}} 8b3f0c3ea055
# 再通过pid进入命名空间
$ nsenter -n -t8303
# 再查看进程
$ ps -ef|grep node_exporter
# 退出当前命名空间
$ exit

insert image description here

  • Design to yaml file:
node-exporter-clusterRoleBinding.yaml # 角色绑定
node-exporter-clusterRole.yaml # 角色
node-exporter-daemonset.yaml # daemonset,容器配置,node-exporter配置
node-exporter-prometheusRule.yaml # 采集规则
node-exporter-serviceAccount.yaml # 服务账号
# K8s集群内的Prometheus抓取监测数据是通过servicemonitor这个crd来完成的。
# 每个servicemonitor对应Prometheus中的一个target。
# 每个servicemonitor对应一个或多个service,负责获取这些service上指定端口暴露的监测数据,并向Prometheus上报。
node-exporter-serviceMonitor.yaml 
node-exporter-service.yaml # 服务

insert image description here

  • Service auto-discovery:
    • Any monitored target needs to be included in the monitoring system in advance for time-series data collection, storage, alarm and display. The monitoring target can be specified in a static form through configuration information, or it can be dynamically managed by Prometheus through the service discovery mechanism.
    • Let's take a look at the traditional configuration method first:
      • First, you need to install node-exporter, get node metrics, and expose a port;
      • Then go to the prometheus.yaml file of Prometheus Server to add the node-exporter job in scarpe_config, and configure the node-exporter address and port information;
      • Then, you need to restart the Prometheus service;
      • Finally, wait for the prometheus service to pull the monitoring information, and then complete the task of adding a node-exporter monitoring.
    • The sample configuration is as follows (prometheus.yml):
- job_name: 'node-exporter'
    static_configs:
    - targets: ['192.168.0.113:9090']   #这里我修改了端口为9090
    • restart service
$ systemctl restart prometheus
    • kube-prometheus service automatic discovery:
      • First, the first step is the same as the traditional method, deploying a node-exporter to obtain monitoring items;
      • Then write a ServiceMonitor to select the node-exporter just deployed through the labelSelector. Since the Operator specifies Prometheus when deploying Prometheus by default, the label is: prometheus: kube-prometheus ServiceMonitor, so you only need to label prometheus: kube-prometheus on ServiceMonitor It can be selected by Prometheus;
      • After completing the above two steps, the monitoring of host resources is completed. There is no need to change the Prometheus configuration file or restart the Prometheus service. Is it very convenient? The Operator will dynamically generate the Prometheus configuration file when it observes changes in the ServiceMonitor, and ensure that the configuration file Effective immediately.

7. Add k8s external monitoring

① Configuration process

  • It may be difficult for a project to achieve full containerization at the beginning, such as databases and CDH clusters, but they still need to be monitored. If they are divided into two sets of prometheus, it is not conducive to management, so these monitoring are added to kube-prometheus uniformly.
  • For a specific introduction to the additionalScrapeConfigs property, you can use the kubectl explain command to learn more about it:
$ kubectl explain prometheus.spec.additionalScrapeConfigs

insert image description here

  • Next, create a new prometheus-additional.yaml file and add additional monitoring components to configure scrape_configs:
$ cat << EOF > prometheus-additional.yaml
- job_name: 'node-exporter-others'
  static_configs:
    - targets:
      - *.*.*.113:31190
      - *.*.*.114:31190
      - *.*.*.115:31190

- job_name: 'mysql-exporter'
  static_configs:
    - targets:
      - *.*.*.104:9592
      - *.*.*.125:9592
      - *.*.*.128:9592

- job_name: 'nacos-exporter'
  metrics_path: '/nacos/actuator/prometheus'
  static_configs:
    - targets:
      - *.*.*.113:8848
      - *.*.*.114:8848
      - *.*.*.115:8848

- job_name: 'elasticsearch-exporter'
  static_configs:
  - targets:
    - *.*.*.113:9597
    - *.*.*.114:9597
    - *.*.*.115:9597

- job_name: 'zookeeper-exporter'
  static_configs:
  - targets:
    - *.*.*.113:9595
    - *.*.*.114:9595
    - *.*.*.115:9595

- job_name: 'nginx-exporter'
  static_configs:
  - targets:
    - *.*.*.113:9593
    - *.*.*.114:9593
    - *.*.*.115:9593

- job_name: 'redis-exporter'
  static_configs:
  - targets:
    - *.*.*.113:9594

- job_name: 'redis-exporter-targets'
  static_configs:
    - targets:
      - redis://*.*.*.113:7090
      - redis://*.*.*.114:7090
      - redis://*.*.*.115:7091
  metrics_path: /scrape
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: *.*.*.113:9594
EOF
  • Then these monitoring configurations need to be stored in the k8s cluster as a secret resource type:
$ kubectl create secret generic additional-scrape-configs --from-file=prometheus-additional.yaml -n monitoring

② Modify the prometheus file

  • additionalScrapeConfigs: Add additional monitoring item configuration:
$ vi prometheus-prometheus.yaml
  • Add the following:
additionalScrapeConfigs:
    name: additional-scrape-configs
    key: prometheus-additional.yaml
  • examine:
$ grep -n -C5 'additionalScrapeConfigs' prometheus-prometheus.yaml

insert image description here

  • The configuration takes effect:
$ kubectl apply -f prometheus-prometheus.yaml

Guess you like

Origin blog.csdn.net/Forever_wj/article/details/131762100