Which Kubernetes health indicators should be monitored

 

The original text was published in the kubernetes Chinese community , the original translation for the author, the original address

For more kubernetes articles, please follow kubernetes Chinese community

table of Contents

1. Resource and utilization indicators

2. Status indicators

3. Control plane indicators

4. Control plane health

Whether there is a leader in etcd cluster

API request delay

Work queue delay

Scheduler problem

5. Event

6. Application Metrics

to sum up


In a recent Circonus survey of Kubernetes operators , which health indicators are collected is one of the biggest challenges that operators face. Considering that Kubernetes can generate millions of metrics every day, this is not surprising.

In this article, we will share which health indicators are the most critical for Kubernetes operators.

1. Resource and utilization indicators

Resource and utilization indicators come from the built-in metrics API, provided by Kubelets itself. Most of the time, we only use CPU usage as an indicator of health, but it is also important to monitor memory usage and network traffic.

index name description
CPU usage usageNanoCores The number of CPU cores used by the node or Pod per second.
CPU capacity capacity_cpu The number of CPU cores available on the node (not applicable to Pods).
Memory usage used{resource:memory,units:bytes} The amount of memory (in bytes) used by the node or pod.
Memory Capacity capacity_memory{units:bytes} The available memory capacity of the node (not applicable to Pod), in bytes.
Network traffic rx{resource:network,units:bytes} tx{resource:network,units:bytes} The total network traffic (received (incoming) traffic and transmitted (outgoing) traffic) seen by the node (or Pod), in bytes.

CPU usage is an important health indicator, and this is the easiest to understand: you should track how much CPU a node is using. There are two reasons. First of all, you don't want to exhaust the processing resources of your application. If your application is limited by CPU, you need to increase CPU allocation or add more nodes to the cluster. Second, you don't want the CPU to be idle there.

2. Status indicators

Kube-state-metrics is a component that provides data about the state of cluster objects (node, pod, DaemonSet, namespaces, etc.).

index name description
Node status kube_node_status_condition {status:true,condition:OutOfDisk| MemoryPressure|PIDPressure| DiskPressure|NetworkUnavailable} When the status is true, it indicates that the node is currently experiencing the condition.
Crash Loops kube_pod_container_status_waiting_reason {reason: CrashLoopBackOff} Indicates whether the container in the pod is undergoing a cycle crash.
Task status (failed) kube_job_status_failed Indicates whether the task failed.
Persistent volume status (failed) kube_persistentvolume_status _phase {phase:Failed} Indicates whether the persistent volume has failed.
Pod status (Pending) kube_pod_status_phase{phase:Pending} Indicates whether the Pod is in a suspended state.
Deployment kube_deployment_metadata _generation Represents the serial number of the Deployment.
Deployment kube_deployment_status_observed_generation Represents the serial number generated by the current Deployment observed by the controller.
DaemonSet expected number of nodes kube_daemonset_status_ desired_number_scheduled The number of nodes expected by the DaemonSet.
DaemonSet current number of nodes kube_daemonset_status_ current_number_scheduled The number of nodes in the DaemonSet running.
Desired copy of StatefulSet kube_statefulset_status_replicas The number of copies expected for each StatefulSet.
A copy of the StatefulSet ready kube_statefulset_status_replicas _ready The number of copies prepared for each StatefulSet.

Using these metrics, you should monitor and alert on the following indicators: crash cycles, disk pressure, memory pressure, PID pressure, network unavailability, task failure, persistent volume failure, pod hangs, deployment failure, DaemonSets not ready and StatefulSets are not ready.

3. Control plane indicators

The Kubernetes control plane contains the "system components" of Kubernetes, which can help with cluster management. In the hosting environment provided by Google or Amazon, the control plane is managed by the cloud provider, and you usually don't have to worry about monitoring these indicators. However, if you manage your own cluster, you need to understand how to monitor the control plane.

index name description
Does the etcd cluster have a leader etcd_server_has_leader Indicate whether the member knows who the leader is.
Total number of leader changes in etcd cluster etcd_server_leader_changes_ seen_total The total number of leader changes in the etcd cluster.
API latency apiserver_request_latencies_count The total number of API requests; used to calculate the average latency of each request.
API delay sum apiserver_request_latencies_sum The sum of the duration of all API requests; used to calculate the average latency of each request.
Queue waiting time workqueue_queue_duration_ seconds The total time spent waiting in the work queue in each controller manager.
Queue duration workqueue_work_duration_ seconds The total time spent processing operations on the work queue in each controller manager.
调度失败Pod的总尝试次数 scheduler_schedule_attempts _total {result:unschedulable} 调度程序尝试在节点上调度失败了Pod的总尝试次数。
Pod调度延迟 scheduler_e2e_scheduling_ delay_microseconds(<v1.14) 或 scheduler_e2e_scheduling_ duration_seconds 将Pod调度到节点上所花费的总时间。

4. 控制平面健康状况

你应该监视控制平面上的以下健康状况:

etcd集群中是否有leader

etcd集群应始终有一个leader(在更改leader的过程中除外,这种情况很少见)。你应该密切注意所有etcd_server_has_leader指标,因为如果很多集群成员没有leader,那么集群性能将会下降。另外,如果你在etcd_server_leader_changes_seen_total中看到leader变更很多次,则可能表明etcd集群存在连接性或资源问题。

API请求延迟

如果将apiserver_request_latencies_count划分为apiserver_request_latencies_sum,则将获得API服务器每个请求的平均延迟。跟踪随时间不断变化的平均请求延迟可以让你知道服务器何时不堪重负。

工作队列延迟

工作队列是由controller manager管理的队列,用于处理集群中的所有自动化流程。监视workqueue_queue_duration_seconds的增加,将使你知道队列延迟何时增加。如果发生这种情况,你可能需要深入研究controller manager日志以查看发生了什么。

调度程序问题

调度程序有两个方面值得关注。首先,你应该监视scheduler_schedule_attempts_total {result:unschedulable},因为无法调度的Pod的增加可能意味着你的集群存在资源问题。其次,你应该使用上面指示的延迟指标之一来监视调度程序延迟。Pod调度延迟的增加可能会导致其他问题,也可能表明集群中存在资源问题。

5. 事件

除了从Kubernetes集群中收集数值指标外,从集群中收集和跟踪事件也很有用。集群事件可以帮助你监视Pod生命周期并观察重大Pod故障,并且监视从集群流出的事件的速率可以是一个很好的预警指标。如果事件发生率突然显着变化,则可能表明发生了问题。

6. 应用程序指标

与我们上面检查的指标和事件不同,应用程序指标不是从Kubernetes本身发出的,而是从集群运行的工作负载发出的。从应用程序的角度来看,这种监控可以是你认为重要的任何事情:错误响应,请求延迟,处理时间等。

关于如何收集应用程序度量标准,有两种方式。第一个是,应该将指标数据从应用程序“推送”到收集端点。这意味着必须将像StatsD这样的客户端与每个应用程序捆绑在一起,以提供一种将指标标准数据推出该应用程序的机制。该技术需要更多的管理开销,以确保正确地检测集群中运行的每个应用程序,因此它在集群管理器中不受欢迎。

第二种是,指标收集原理(正在被越来越广泛地采用),指标应由收集代理从应用程序中“拉取”。这使应用程序更易于编写,因为它们所要做的只是适当地发布了它们的指标标准,但是应用程序不必担心这些度量标准是如何被提取或删除的。这是OpenMetrics的工作方式,也是Kubernetes集群指标收集的方式。当此技术与收集代理的服务发现相结合时,它将创建一种功能强大的方法,用于从集群应用程序中收集你需要的任何类型的指标。

总结

Kubernetes每天可以生成数百万个新指标。这会带来两个大挑战。首先,许多常规的监视系统无法正确监视Kubernetes集群的大量指标。其次,数据"庞杂"使你难以跟上并知道哪些指标最重要。

你的Kubernetes监视解决方案必须具有处理所有这些数据的能力,并能够自动分析,图形化和警告要注意的最关键指标。这样,你就知道自己已经收集了期望的一切,过滤掉了不必要的数据,并自动缩小了相关数据的范围。因此,你可以节省大量时间,并确保一切都能正常进行。

译文链接:https://thenewstack.io/which-kubernetes-health-metrics-should-be-monitored/

Guess you like

Origin blog.csdn.net/fly910905/article/details/112467716