Master Prometheus and Grafana from 0 to 1 at the speed of light

author

Huang Lei, senior engineer of Tencent Cloud, was responsible for building a new generation of multi-dimensional business monitoring system for Tencent Cloud Cloud Monitoring. He is good at large-scale distributed monitoring system design and has a deep understanding of golang background project architecture design. Later, he joined the TKE team and devoted himself to the study of Kubernetes. Relevant operation and maintenance technology, he has many years of experience in Kubernetes cluster federation operation and maintenance management. Currently, the team is mainly responsible for the improvement of large-scale cluster federation observability, and has led the development of Tencent Cloud 10,000-level Kubernetes cluster monitoring and alarm system, intelligent inspection and risk detection system.

Summary

If you ask me what open source components are definitely used when managing Kubernetes clusters, then I think Prometheus will be one of them. Prometheus has strong performance, active ecology, convenient deployment methods, and flexible PromQL, which is especially suitable for monitoring data collection and aggregation at all levels such as master, node, and application in Kubernetes scenarios, combined with the dazzling Grafana panel (As shown in the figure below), it can be said to be the best solution for cloud-native monitoring.

Although Prometheus and Grafana are very powerful, when I first came into contact with them, there was still a certain cost of learning and it was not easy to get started. This point is particularly touching to the author. I remember a few years ago when the author was not in charge of improving the cloud native observability of the team, I often heard a friend who was new to Prometheus complaining to the author all day long, "Hey, why is the syntax of Prometheus so complicated", "This thing is too Disgusting, how to write this." At that time, the author also laughed at him for exaggerating, but when I also started to learn Prometheus and started to match the Grafana panel, I also made the same complaints, such as the following statement.

 max(label_replace(
 label_replace(
 label_replace(
 kube_deployment_status_replicas_unavailable,
 "workload_kind","Deployment","","")
 ,"workload_name","$1","deployment","(.*)"),
 "__name__", "k8s_workload_abnormal", "__name__","(.*)")
 )
 by (namespace, workload_name, workload_kind,__name__)
 or on (namespace,workload_name,workload_kind, __name__) max(label_replace(
 label_replace(
 label_replace(
 kube_daemonset_status_number_unavailable,
 "workload_kind","DaemonSet","","")
 ,"workload_name","$1","daemonset","(.*)"),
 "__name__", "k8s_workload_abnormal", "__name__","(.*)") ) by (namespace, workload_name, workload_kind,__name__)
 or on (namespace,workload_name,workload_kind, __name__)
 max(label_replace(
 label_replace(
 label_replace(
 (kube_statefulset_replicas - kube_statefulset_status_replicas_ready),
 "workload_kind","StatefulSet","","")
 ,"workload_name","$1","statefulset","(.*)"),
 "__name__", "k8s_workload_abnormal", "__name__","(.*)") ) by (namespace, workload_name, workload_kind,__name__)
 or on (namespace,workload_name,workload_kind, __name__)
 max(label_replace(
 label_replace(
 label_replace(
 (kube_job_status_failed),
 "workload_kind","Job","","")
 ,"workload_name","$1","job_name","(.*)"),
 "__name__", "k8s_workload_abnormal", "__name__","(.*)") ) by (namespace, workload_name, workload_kind,__name__)
 or on (namespace,workload_name,workload_kind, __name__)
 max(label_replace(
 label_replace(
 label_replace(
 (kube_cronjob_info * 0),
 "workload_kind","CronJob","","")
 ,"workload_name","","cronjob","(.*)"),
 "__name__", "k8s_workload_abnormal", "__name__","(.*)") ) by (namespace, workload_name, workload_kind,__name__)

In the past few years, the author has accumulated some practical experience in the process of using Prometheus, and also stepped on a lot of pits.

In order to allow readers who want to learn Prometheus to get started faster, avoid detours, and improve business monitoring skills in the cloud native era.

The author organizes and summarizes a version of the tutorial, including some of the most basic and core concepts, skills and best practices to share with you, so that you can master 80% of the most commonly used parts in 20% of the time.

Learn how to expose monitoring indicators to your business from scratch, how to correctly configure service discovery, and how to configure a practical Grafana panel, guide readers to get started with Prometheus+Grafana at the speed of light, and master the correct posture of cloud native monitoring. picture

"Tencent Cloud Native" official account backstage reply "Prometheus" or "Light Speed ​​Introduction" to get the tutorial! Let's learn together!

Small Tips: The textbook currently has a website version (which needs to be opened in a browser) and a PDF version. Children's shoes can view it according to their own needs. The website version of this textbook will continue to be updated, you can continue to pay attention~

At the same time, you are welcome to submit issues to the tutorial. This tutorial will be updated, expanded and revised from time to time based on your feedback!

(The GitHub address of the issue is mentioned)

The textbook list is as follows

[Tencent Cloud Native] New products of Yunshuo, new techniques of Yunyan, new activities of Yunyou, and information of cloud appreciation, scan the code to follow the public account of the same name, and get more dry goods in time! !

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324139906&siteId=291194637