3 configuration challenges for Prometheus to monitor Kubernetes

The original text was published in the kubernetes Chinese community , the original translation for the author, the original address

For more kubernetes articles, please follow kubernetes Chinese community

table of Contents

Challenge 1: Manually configure the application

Solution: Use GitOps to maintain control

Challenge 2: Manually create configuration files and dashboards

Solution: Code Generator

Challenge 3: Configure synchronization

Solution: Use abstract methods to achieve reuse and keep the generated files in sync


Observability is essential for running a large number of workloads in a Kubernetes cluster. Prometheus is a monitoring system and time series database that has been widely proven to be good at managing large-scale, dynamic Kubernetes environments. In fact, Prometheus is considered the basic building block for running applications on Kubernetes and has become the de facto standard for visibility and monitoring in the Kubernetes environment.

Although Prometheus is open source, it does not provide the configuration required to monitor Kubernetes workloads for free.

Note : In this article, I will not discuss the challenges of Prometheus and multi-cluster high availability setups. Instead, I focused on how to extend Prometheus to more applications and create dashboards for each application so that more people can use it. If you are interested in high availability settings, you can refer to projects such as Thanos or VictoriaMetrics .

To start using Prometheus, you can configure scraping to extract metrics from the service, use Grafana to build dashboards, and define alerts for important metrics that exceed thresholds in the production environment (see the image below).

Once you choose Prometheus, the first challenge is to scale and manage Prometheus for the entire application and environment .

img

Challenge 1: Manually configure the application

The workload of modern software usually consists of hundreds or thousands of microservices, which are not only multiple instances of the same application, but also different smaller applications that communicate with each other. They are all carefully organized by Kubernetes. These workloads are not run in a single cluster or a single environment, but are distributed across multiple clusters and environments (such as development, testing, and production).

For example, as of the end of 2019, Uber's workload has grown to more than 4,000 microservices . To manage and operate such complex applications, you need advanced observability, which requires specific configuration of crawling, dashboards and alerts for each application. Not only must you create these configurations, but you must also apply them to each environment. Moreover, every time a change occurs, it is often done manually.

Question : This all means that it takes a lot of labor to manage the configuration in the Prometheus and Grafana ecosystem.

Solution: Use GitOps to maintain control

You can use the "GitOps" approach instead of temporary application configuration, where the Git repository saves all the configuration, documents and code, and the operator component automatically applies it to the system to be managed-such as Prometheus, Grafana, or even a Kubernetes cluster.

Do not directly make any changes to the Prometheus configuration or the Grafana dashboard. Instead, all changes must be submitted to the Git repository first, and then synchronized to Prometheus, Grafana or other tools.

One of the many benefits of the GitOps approach is the ability to version control all configurations to identify when and why each change occurred. For problematic changes, you can easily roll them back. Using this method, you can also use the concept of pull requests to improve the configuration.

The figure below shows a Git repository and operator to manage all configuration files. The operator must have the logic and authority to configure the underlying system.

img

Comparison of manual application configuration and GitOps method

Challenge 2: Manually create configuration files and dashboards

The first step is that the settings are under version control, and saving all configurations to GitOps is the first step. But there is still a lot of manual configuration to deal with.

Learning PromQL queries in Prometheus is not an easy task. In addition to PromQL, you also need Grafana dashboard configuration (written in JSON format) to fully understand your application. You also need the alert rules (written in Yaml format) in Prometheus to set up failure alerts.

Problem: You need a team of engineers composed of different configuration languages ​​to write and maintain all manual configurations.

Solution: Code Generator

The code generator can save! You can use a code generator to ease manual work, instead of manually writing configurations for Prometheus, alert managers, and Grafana dashboards.

A good example is the Prometheus alarm and recording rules generated based on the SRE concept , such as the Golden Signals or RED method , or even the USE method , which are widely regarded as the most useful and critical indicators. Another example is the generation of Grafana dashboards (for example, see uber/grafana-dash-gen , metalmatze/slo-libsonnet and prometheus-operator/kube-prometheus on the GitHub website, and Scripted Dashboards on the Grafana Labs website ).

Using a code generator can speed up the configuration work. The generated files are stored in the Git repository to get all the benefits I discussed earlier.

The figure below compares the manual configuration with the code generated configuration, and shows how the latter method can accomplish the heavy lifting and reduce user errors.

img

Manually write configuration and use code generator

Challenge 3: Configure synchronization

Once you start using the code generator, you will eventually get a large number of automatically generated configuration files. Those configurations stored in the Git repository are independent of each other. There is no control mechanism to ensure that they are based on the same input file. In fact, this is not even possible, because the code generator may rely on different kinds of input.

For example: the output result of changing the input parameters of code generator 1 is not synchronized with the output of code generator 2 or 3. As a result, there is no synchronization mechanism between the generated files.

There are only a few solutions that can solve this problem, such as prometheus-operator / kube-prometheus .

Problem: Manual operation is needed to change each input to create a new generation version of the configuration file.

Solution: Use abstract methods to achieve reuse and keep the generated files in sync

Abstract methods in software engineering enable reuse and can help overcome the challenge of configuration files not being synchronized. The introduction of an intermediate language with the concept of SRE (Site Reliability Engineering) can help provide a technical foundation.

The following figure shows how to introduce such as jsonnet or other intermediate languages, so that you can define common concepts and generate specific configuration files for different platforms such as Prometheus and Grafana. Using this high-level programming language, you can abstract implementation details. But the language you use must provide all the concepts that are ubiquitous in the Prometheus surveillance domain.

A mature SRE concept is based on the service level objective (SLO) concept, which allows you to define goals for each microservice. Using machine- and human-readable code (such as Yaml files), you can generate configurations for multiple tools and make all configurations meet the defined service level goals. This reduces complexity and makes it easier for you to deal with the operation and expansion of the Prometheus environment.

img

Compare the method without abstraction with the new method of abstraction based on the concept of SRE

Translation link: https://thenewstack.io/3-key-configuration-challenges-for-kubernetes-monitoring-with-prometheus/

Guess you like

Origin blog.csdn.net/fly910905/article/details/109253881