How to use Prometheus to monitor a Kubernetes cluster of 100,000 containers

Overview

Not long ago, we were in the article " How to extend a single Prometheus to monitor nearly 10,000 Kubernetes clusters? "Introduced in detail the evolution of the TKE team's large-scale Kubernetes federation monitoring system Kvass, which introduced how we modified the Prometheus code to achieve horizontal expansion and contraction for larger clusters. After the improvement of the scheme, Kvass has now supported the realization of Prometheus clustering by Sidecar instead of modifying the Prometheus code. Since the solution has a certain value to the community, the team decided to open source the project and share it with the community. project address

This article will first give the stand-alone performance bottlenecks of Prometheus and the existing community clustering solutions, and then will introduce the design ideas, use cases and stress test results of the open source version of Kvass in detail.

In addition, the Tencent Cloud Container team further optimized the design concept of Kvass and built a high-performance cloud native monitoring service that supports multiple clusters. The product has been officially tested and readers are welcome to try it. Portal: https://console.cloud.tencent.com/tke2/prometheus

In subsequent chapters, we will also directly use this product to demonstrate Kvass's ability to monitor large-scale clusters.
1

Prometheus

Relying on its strong stand-alone performance, flexible PromSQL, and active community ecology, Prometheus has gradually become the core monitoring component in the cloud-native era, and is used by major manufacturers around the world to monitor their core business.

However, when facing large-scale monitoring targets (tens of millions of series), since the native Prometheus only has a single-machine version and does not provide clustering functions, developers have to continuously increase the configuration of the machine to meet the rising memory of Prometheus.

Single machine performance bottleneck

The stress test we performed on the stand-alone Prometheus is used to detect the reasonable load of a single Prometheus shard. There are two targets for the stress test.

  • Determine the relationship between the number of targets and Prometheus load
  • Determine the relationship between the number of series and Prometheus load

target correlation

We keep the total series unchanged at 1 million, and observe the changes in Prometheus load by changing the number of targets.
2
Pressure test results

target number CPU (core) mem (GB)
100 0.17 4.6
500 0.19 4.2
1000 0.16 3.9
5000 0.3 4.6
  • From the table, we find that the impact of the change in the number of targets on the Prometheus load is not strongly correlated. When the number of targets increases by 50 times, the CPU consumption increases slightly, but the memory is almost unchanged.

series correlation

We keep the number of targets unchanged, and observe the changes in Prometheus's load by changing the total number of series.

3

Pressure test results

Number of series (ten thousand) CPU (core) mem (GB) Query 1000 series 15m data(s)
100 0.191 3.15 0.2
300 0.939 20.14 1.6
500 2.026 30.57 1.5
  • From the table, the load of Prometheus is greatly affected by the series. The more series, the greater the resource consumption.
  • When the series data exceeds 3 million, the memory growth of Prometheus is more obvious, and a machine with larger memory is required to run.

During the stress testing process, we used a tool to generate the expected number of series. The length of each label and value of the series generated by the tool is small, fixed at about 10 characters. Our purpose is to observe relative load changes. In actual production, due to the different label lengths and the different consumption of the service discovery mechanism, the load consumed by the same number of series will be much higher than in the stress test.

Existing clustering solution

In view of the performance bottleneck problem of single-machine Prometheus in large-scale data monitoring, the community currently has some fragmentation solutions, mainly including the following.

hash_mod

Prometheus officially supports the use of the Relabel mechanism to hash the collected data in the configuration file, specify different moduleIDs in the configuration files of different Prometheus instances to fragment, and then unify the data through federation, Thanos, etc. Summary, as shown in the figure below, readers can also directly refer to [ Official Documents ].
4

Profile split

Another method is to perform job-level segmentation according to the business. Different Prometheus use completely independent collection configurations, which contain different jobs.
5

Problems with the above scheme

Whether it is the hash_mod method or the configuration file segmentation method, the essence is to split the data into multiple acquisition configurations, which are collected by different Prometheus. Both have the following shortcomings.

  • To understand the pre-monitoring data: the premise of using the above method is that the user must understand the data that the monitored object will report, for example, you must know that the monitored object will report a label for hash_mod, or you must know the different jobs The overall size can divide the job.
  • Unbalanced instance load: Although the above schemes are expected to disperse data to different Prometheus instances, in fact, hash_mod is performed through some label values, or simply divided by job does not guarantee that each instance will be final The number of collected series is balanced, and the instance still has the risk of high memory usage.
  • The configuration file is intrusive: the user must modify the original configuration file, add Relabel related configuration, or divide a configuration file into multiple copies. Since the configuration file is no longer single, it is more difficult to modify the configuration.
  • Unable to dynamically expand and shrink: The configuration in the above scheme is specially formulated according to the actual monitoring target data scale, and there is no unified expansion and shrinking scheme, which can increase the number of Prometheus when the data scale grows. Of course, it is indeed possible for users to write tools for expansion and contraction according to the actual situation of their business, but this method cannot be reused between different businesses.
  • Part of the API is no longer normal: The above scheme breaks up the data into different instances, and then aggregates them through Federation or Thanos to obtain global monitoring data. However, without additional processing, some Prometheus native APIs cannot get the correct values. The most typical is /api/v1/targets , the global targets value cannot be obtained under the above scheme.

The principle of Kvass

Design goals

In response to the above problems, we hope to design a non-intrusive clustering solution. What it shows to users is a virtual Prometheus that is consistent with the native Prometheus configuration file, API compatible, and scalable. Specifically, we have the following design goals.

  • Non-invasive, single configuration file: What we want users to see and modify is a native configuration file without any special configuration.
  • No need to perceive the monitoring object : We hope that users no longer need to know the collection object in advance and do not participate in the process of clustering.
  • The instance load is as balanced as possible: We hope to divide the collection tasks according to the actual load of the monitoring target, so that the instances are as balanced as possible.
  • Dynamic expansion and contraction: We hope that the system can dynamically expand and contract according to the changes in the scale of the collection object, and the data is constantly being clicked and not missing.
  • Compatible with core PrometheusAPI: We hope that some of the more core APIs, such as the /api/v1/target interface mentioned above, are normal.

Architecture

Kvass is composed of multiple components. The following figure shows the architecture diagram of Kvass. We use Thanos in the architecture diagram. In fact, Kvass does not strongly depend on Thanos and can be replaced with other TSDBs.
How to use Prometheus to monitor a Kubernetes cluster of 100,000 containers

  • Kvass sidecar: Used to receive collection tasks issued by Coordinator, generate new configuration files for Prometheus, and also serve and maintain target load conditions.
  • Kvass coordinator: This component is the central controller of the cluster, responsible for service discovery, load detection, target delivery, etc.
  • Thanos component: Only the Thanos sidecar and Thanos query are used in the figure to summarize the sharded data to obtain a unified data view.

Coordinator

Kvass coordinaor will first replace Prometheus to do service discovery for collection targets, and obtain a list of targets to be collected in real time.

For these targets, Kvass coordinaor will be responsible for load detection and evaluation of the number of series for each target. Once the target load is detected successfully, Kvass coordinaor will assign the target to a certain load below the threshold in the next calculation cycle .

Kvass coordinaor is also responsible for scaling the sharded cluster.
7

Service discovery

Kvass coordinaor quoted native Prometheus service discovery code to achieve 100% compatible service discovery capabilities with Prometheus. For the targets to be crawled from service discovery, Coordinaor will process the relabel_configs in its application configuration file, and after processing it The targets and its label collection. The target obtained after service discovery is sent to the load detection module for load detection.

Load detection

The load detection module obtains the processed targets from the service discovery module, and combines the capture configuration (such as proxy, certificate, etc.) in the configuration file to capture the target, and then analyzes and calculates the capture result to obtain the target series scale.

The load detection module does not store any captured index data, but only records the load of the target. Load detection only detects the target once, and does not maintain subsequent target load changes. The load information of the long-running target is maintained by Sidecar. We will The following chapters introduce.

Target allocation and expansion

In the section on Prometheus stand-alone performance bottlenecks, we introduced that the memory of Prometheus is related to the series. To be precise, the memory of Prometheus is directly related to its head series. Prometheus will cache the series information of the data collected recently (2 hours by default) in memory. If we can control the number of head series in each shard's memory, we can effectively control the memory usage of each shard , And controlling the head series is actually controlling the target list currently collected by the slice.

Based on the above ideas, Kvass coordinaor will periodically manage the list of targets currently collected for each shard: assign new targets and delete invalid targets.

In each cycle, Coordinaor will first obtain the current running status from all shards, including the number of series in the current memory of the shard and the target list currently being fetched. Then perform the following processing for the global target information obtained from the service discovery module

  • If the target has been captured by a certain shard, it will continue to be allocated to it, and the number of series in the shard will not change.
  • If the target does not have any fragments to capture, then obtain its series from the load detection module (if not detected, skip, and continue in the next cycle), and pick a series from the current memory series plus the target's series from the fragments Those who are still lower than the threshold will be assigned to him.
  • If all current shards cannot accommodate all the targets to be allocated, then expand the capacity, and the number of expansions is proportional to the total amount of the global series.

Target migration and scaling

During the operation of the system, the target may be deleted. If the target of a shard is deleted for more than 2 hours, the head series in the shard will be reduced, that is, there will be part of the idle, because the target is allocated For different shards, if a large number of targets are deleted, the memory usage of many shards is very low. In this case, the resource utilization of the system is very low, and we need to shrink the system.

When this happens, Coordinaor will migrate the target, that is, the target in the shard with a larger sequence number (slices will be numbered from 0) to the shard with a lower sequence number, and finally the shard with the lower sequence number The load becomes higher, and the slices with the higher sequence numbers are completely free. If the storage uses thanos, and the data is stored in the cos, the idle shards will be deleted after 2 hours (to ensure that the data has been transferred to the cos).

Multiple copies

Kvass shards currently only support deployment in StatefulSet mode.

The Coordinator will obtain all sharded StatefulSets through the label selector. Each StatefulSet is considered a copy. Pods with the same number in the StatefulSet will be considered to be the same shard group, and Pods in the same shard group will be assigned the same target and Expect the same load.
8

/api/v1/targets interface

As mentioned above, Coordinator did service discovery based on the configuration file and got the target list, so Coordinator can actually get the return result set required by the /api/v1/targets interface, but because Coordinator only does service discovery, it does not proceed. Actual collection, so the target's collection status (such as health status, last collection time, etc.) cannot be directly known.

When the Coordinator receives the /api/v1/targets request, it will ask the Sidecar (if the target has been allocated) or the detection module (the target has not been allocated) based on the target collection obtained from the service discovery, and then ask for the target collection status. The correct /api/v1/targets result is returned.

Sidecar

The previous section introduced the basic functions of Kvass coordinaor. For the normal operation of the system, the cooperation of Kvass sidecar is also required. The core idea is to change all service discovery modes in the configuration file to static_configs and directly write the target information that has been relabeled. Into the configuration, to achieve the effect of eliminating fragmented service discovery and relabel behavior, and only collecting part of the target.
9

Each shard will have a Kvass sidecar. Its core function includes receiving a list of targets responsible for this shard from the Kvass coordinator, and generating a new configuration file for the Prometheus of the shard to use. In addition, the Kvass sidecar will also hijack the crawl request and maintain the latest target load. The Kvass sidecar also serves as a gateway to the PrometheusAPI to correct some of the request results.
10

Configuration file generation

After service discovery, relabel and load detection, Coordinaor will assign the target to a shard and deliver the target information to the Sidecar, including

  • the address of the target,
  • Target estimated series value
  • hash value of target
  • The label collection after relabel is processed.

Sidecar generates a new configuration file for Prometheus according to the target information obtained from the Coordinator and combined with the original configuration file. This new configuration file has been modified as follows.

  • Change all service discovery mechanisms to static_configs mode and write directly into the target list, each target contains the label value after relabel
  • Since the target has been relabeled now, delete the relabel_configs item in the job configuration, but still retain metrics_rebale_configs
  • Replace all the scheme fields in the label of the target with http, and add the original schme to the label collection in the form of request parameters
  • Add the job_name of the target to the label collection in the form of request parameters * Inject proxy_url to proxy all crawling requests to Sidecar

11

Let’s look at an example, if the original configuration is a kubelet collection configuration

global:
  evaluation_interval: 30s
  scrape_interval: 15s
scrape_configs:
- job_name: kubelet
  honor_timestamps: true
  metrics_path: /metrics
  scheme: https
  kubernetes_sd_configs:
  - role: node
  bearer_token: xxx
  tls_config:
    insecure_skip_verify: true
  relabel_configs:
  - separator: ;
    regex: __meta_kubernetes_node_label_(.+)
    replacement: $1
    action: labelmap

A new configuration file will be generated by injection

global:
  evaluation_interval: 30s
  scrape_interval: 15s
scrape_configs:
- job_name: kubelet                                        
  honor_timestamps: true                                                                      
  metrics_path: /metrics                                   
  scheme: https  
  proxy_url: http://127.0.0.1:8008 # 所有抓取请求代理到Sidecar
  static_configs:                                          
  - targets:                                               
    - 111.111.111.111:10250                                   
    labels:                                                
      __address__: 111.111.111.111:10250                      
      __metrics_path__: /metrics                           
      __param__hash: "15696628886240206341"                
      __param__jobName: kubelet                            
      __param__scheme: https  # 保存原始的scheme                             
      __scheme__: http        # 设置新的scheme,这将使得代理到Sidecar的抓取请求都是http请求
      # 以下是经过relabel_configs处理之后得到的label集合
      beta_kubernetes_io_arch: amd64                       
      beta_kubernetes_io_instance_type: QCLOUD             
      beta_kubernetes_io_os: linux                         
      cloud_tencent_com_auto_scaling_group_id: asg-b4pwdxq5
      cloud_tencent_com_node_instance_id: ins-q0toknxf     
      failure_domain_beta_kubernetes_io_region: sh         
      failure_domain_beta_kubernetes_io_zone: "200003"     
      instance: 172.18.1.106                               
      job: kubelet                                         
      kubernetes_io_arch: amd64                            
      kubernetes_io_hostname: 172.18.1.106                 
      kubernetes_io_os: linux

The newly generated configuration file above is the configuration file actually used by Prometheus. Sidecar generates the configuration through the target list issued by Coordinator, so that Prometheus can selectively collect.

Grab and hijack

In the configuration generation above, we will inject proxy into the job configuration, and in the label of the target, the scheme will be set to http, so all crawl requests from Prometheus will be proxied to Sidecar. The reason for this is that This is because Sidecar needs to maintain the new series scale for each target, which will be used as a reference for target migration after Coordinator check.

From the configuration generation above, we can see that the following additional request parameters will be sent to Sidecar.

  • Hash: The hash value of the target, which is used by Sidecar to identify which target is the crawl request. The hash value is calculated by the Coordinator according to the label set of the target and passed to Sidecar.
  • jobName: The crawl request under which job is used by Sidecar to initiate a real request to the crawl target according to the request configuration of the job in the original configuration file (such as the original proxy_url, certificate, etc.).
  • scheme: The scheme here is the protocol value finally obtained after the target passes the relabel operation. Although there is already a scheme field in the job configuration file, the Prometheus configuration file still supports the request protocol of specifying a target through relabel. In the above process of generating new configuration, we save the real scheme to this parameter, and then set all schemes to http.

With the above parameters, Sidecar can initiate the correct request to the crawling target and obtain the monitoring data. After calculating the scale of the series captured by the target this time, Sidecar will copy the monitoring data to Prometheus.
How to use Prometheus to monitor a Kubernetes cluster of 100,000 containers

API proxy

Due to the existence of Sidecar, some API requests sent to Prometheus need to be specially processed, including

  • /-/reload: Since the configuration file actually used by Prometheus is generated by Sidecar, this interface needs to be processed by Sidecar and the Prometheus //reload interface is called after the processing is successful.
  • /api/v1/status/config: This interface needs to be processed by Sidecar and the original configuration file is returned.
  • Other interfaces are sent directly to Prometheus.

Global data view

Since we have scattered the collection targets into different shards, the data of each shard is only part of the global data, so we need to use additional components to aggregate and de-duplicate all the data (in the case of multiple copies), Get the global data view.

Take thanos as an example

Thanos is a very good solution. By adding the thanos component, you can easily get the global data view of the kvass cluster. Of course, we can also use other TSDB schemes, such as influxdb, M3, etc., by adding the remote writer configuration.
How to use Prometheus to monitor a Kubernetes cluster of 100,000 containers

Usage example

In this section, we use a deployment example to intuitively feel the effect of Kvass. The relevant yaml file can be found here https://github.com/tkestack/kvass/tree/master/examples
Readers can clone the project locally, and Enter examples.

git clone https://github.com/tkestack/kvass.git
cd kvass/examples

Deploy the data generator

We provide a metrics data generator, which can specify to generate a certain number of series. In this example, we will deploy 6 copies of metrics generator, each will generate 10045 series (45 series are Golang metrics).

kubectl create -f  metrics.yaml

Deploy kvass

Now we deploy a Prometheus cluster based on Kvass to collect the metrics of these 6 metrics generators.

First we deploy rbac related configuration

kubectl create -f kvass-rbac.yaml

Then deploy a Prometheus config file, this file is our original configuration, in this configuration file, we use kubernetes_sd for service discovery

kubectl create -f config.yaml

The configuration is as follows

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: custom
scrape_configs:
- job_name: 'metrics-test'
  kubernetes_sd_configs:
    - role: pod
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
    regex: metrics
    action: keep
  - source_labels: [__meta_kubernetes_pod_ip]
    action: replace
    regex: (.*)
    replacement: ${1}:9091
    target_label: __address__
  - source_labels:
    - __meta_kubernetes_pod_name
    target_label: pod

Now let’s deploy Kvass coordinator

kubectl create -f coordinator.yaml

We set the maximum number of head series for each shard in the startup parameters of Coordinator to not exceed 30000

--shard.max-series=30000

We can now deploy Prometheus with Kvass sidecar, here we only deploy a single copy

kubectl create -f prometheus-rep-0.yaml

Deploy thanos-query

In order to get global data, we need to deploy a thanos-query

kubectl create -f thanos-query.yaml

View Results

According to the above calculation, there are a total of 6 targets, 60270 series for monitoring targets. According to our setting, each fragment cannot exceed 30000 series, and 3 fragments are expected.

We found that Coordinator successfully changed the number of copies of StatefulSet to 3.
14

We look at the number of series in a single slice of memory and find that there are only 2 targets
15

We then use thanos-query to view the global data and find that the data is complete (metrics0 is the name of the indicator generated by the indicator generator)
16

17

Cloud native monitoring

The Tencent Cloud Container team has further optimized Kvass' design ideas and built a high-performance support multi-cluster cloud native monitoring service. The product has been officially tested.

Large cluster monitoring

In this section, we will directly use the cloud native monitoring service to monitor a large-scale real cluster, and test the ability of Kvass to monitor large clusters.
18

Cluster size

The scale of our associated cluster is roughly as follows

  • 1060 nodes
  • 64000+ Pod
  • 96000+ container

Acquisition configuration

We directly use the default collection configuration added by the cloud native monitoring service in the associated cluster, which currently includes the mainstream monitoring indicators of the community:

  • kube-state-metrics
  • node-exporer
  • kubelet
  • cadvisor
  • cube-apiserver
  • kube-scheduler
  • cube-controller-manager

19

20

Test Results

21

  • Total 3400+target, 27+ million series
  • A total of 17 shards have been expanded
  • Each shard series is stable below 200w
  • Each shard consumes about 6-10G of memory

The default Grafana panel provided by cloud native monitoring can also be pulled normally
22

The targets list can also be pulled normally
23

Multi-cluster monitoring

It is worth mentioning that the cloud-native monitoring service not only supports monitoring a single large-scale cluster, but also monitors multiple clusters with the same instance, and supports collection and alarm template functions. The collected alarm templates can be distributed to all regions in one click. Cluster, completely bid farewell to the problem of repeated adding configuration for each cluster.
24

to sum up

This article introduces in detail an open source Prometheus clustering technology from problem analysis, design goals, principle analysis, use cases, etc., which can support horizontal expansion and contraction without modifying the Prometheus code, so as to monitor the single-machine Prometheus cannot monitor Large-scale clusters.

Guess you like

Origin blog.51cto.com/14120339/2554468