kube-state-metrics indicator data
1. kube-state-metrics description
kube-state-metrics focuses on obtaining the latest state of various resources of k8s, such as deployment or daemonset. The reason why kube-state-metrics is not included in the capability of metric-server is that their focus is essentially different of. The metric-server just obtains and formats existing data and writes them into specific storage. It is essentially a monitoring system. And kube-state-metrics takes a snapshot of the running status of k8s in memory and obtains new metrics, but it does not have the ability to export these metrics.
2. Monitoring the number of Node nodes
PromQL statement:
kube_node_info{
instance="10.42.4.65:8080"}
Note: You can query how many Node nodes the K8S cluster corresponds to, compare the alarm with the actual number of nodes, or make a large screen
3. The cluster node status is wrong
PromQL statement:
kube_node_status_condition{
condition="Ready",status!="true"}==1
Description: Monitor whether the status of the cluster node is wrong, if the value is 1, it can be alarmed if there is an error
4. Is the cluster node status ready?
PromQL statement:
kube_node_status_condition{
condition="Ready",status="true"} == 0
Description: Monitoring whether the status of the cluster node is ready, a bit like the status obtained by kubectl get node
4.1. Is there a shortage of cluster node memory or disk resources?
PromQL statement:
kube_node_status_condition{
condition=~"OutOfDisk|MemoryPressure|DiskPressure",status!="false"}==1
5. There is a failed PVC monitoring in the cluster
PromQL statement:
kube_persistentvolumeclaim_status_phase{
phase="Failed"}==1
6. Pod monitoring that failed to start in the cluster
PromQL statement:
kube_pod_status_phase{
phase=~"Failed|Unknown"}==1
7. Pod container restart monitoring in the last 30 minutes
PromQL statement:
changes(kube_pod_container_status_restarts_total[30m])
8. The total number of cores requested by the node CPU (unit cores)
PromQL statement:
sum(kube_pod_container_resource_requests_cpu_cores{
})
Description: Monitor the number of CPU cores requested by the container
9. Node CPU limits the total number of cores
PromQL statement:
sum(kube_pod_container_resource_limits_cpu_cores{
})
10. The total amount of node CPU
PromQL statement:
sum(kube_node_status_capacity_cpu_cores{
})
11.Node memory request value (GB after bytes 1024/1024/1024)
PromQL statement:
sum(kube_pod_container_resource_requests_memory_bytes{
})/1024/1024/1024
12. Node memory limit value
PromQL statement:
sum(kube_pod_container_resource_limits_memory_bytes{
})
13. Total node memory
PromQL statement:
sum(kube_node_status_capacity_memory_bytes{
}/1024/1024/1024)
14. Node unavailability monitoring
PromQL statement:
sum(kube_node_spec_unschedulable{
node=~"$node"})
15. Pod life cycle monitoring
PromQL statement:
kube_pod_status_phase{
phase=~"Pending|Running"} == 1
Description: kube_pod_status_phase can count the number of Pods, the value of phase: Running (running) Pod has been bound to a node, and all containers in the Pod have been created. At least one container is still running, or is being started or restarted. The life cycle description of Pod on the official website:
https://kubernetes.io/zh/docs/concepts/workloads/pods/pod-lifecycle/
PromQL statement:
sum(kube_pod_status_phase{
namespace=~".*", phase="Pending"}==1)
Description: Monitor the number of pending Pods; Pod has been accepted by the Kubernetes system, but one or more containers have not been created or run. This stage includes the time to wait for the Pod to be scheduled and the time to download the mirror through the network
PromQL statement:
sum(kube_pod_status_phase{
namespace=~".*", phase="Failed"}==1)
Description: Monitor the number of successfully terminated Pods; all containers in the Pod have been terminated, and at least one container was terminated due to failure. In other words, the container exits with a non-zero status or is terminated by the system
PromQL statement:
sum(kube_pod_status_phase{
namespace=~".*", phase="Succeeded"}==1)
Description: Monitor the number of successfully terminated Pods; all containers in the Pod have been successfully terminated, and the
PromQL statement will not be restarted :
sum(kube_pod_status_phase{
namespace=~".*", phase="Unknown"}==1)
Description: Monitor the number of unknown Pods; Pod status cannot be obtained for some reasons. This situation is usually due to a failure to communicate with the host where the Pod is located
16.Monitor the running containers of K8S
PromQL statement:
kube_pod_container_status_running{
namespace=~".*"}==1
Description: Monitor the number of running containers; you can plot the number of containers in the K8S cluster
17.Monitor K8S containers waiting to be created
PromQL statement:
kube_pod_container_status_waiting{
namespace=~".*"}==1
Description: Monitor the containers that K8S is waiting to create; you can monitor the containers
18.Monitor K8S stopped containers
PromQL statement:
kube_pod_container_status_terminated{
namespace=~".*"}==1
Description: Monitor K8S stopped containers; you can monitor containers
19. The number of successful monitoring operations center
PromQL statement:
sum(kube_job_status_succeeded{
namespace=~".*"})
Description: Monitor the number of successes in the job center; you can monitor the number of successful job executions
20. Monitor the number of copies of each deployment
PromQL statement:
sum(kube_deployment_status_replicas{
namespace=~".*"})
Description: kube_deployment_status_replicas represents the number of replicas of each deployment, this value is Status.Replicas;
kube_deployment_spec_replicas represents the number of pods required for deployment. This value is the number of Spec.Replicas resource definition replicas
kube_deployment_status_replicas_available The number of running replicas
kube_deployment_status_replicas_updated The number of updated replicas
kube_deployment_status_replicas_unavailable The number of unavailable replicas
21. Cluster disk usage
PromQL statement:
(sum (node_filesystem_size_bytes{
nodename=~".*"}) - sum (node_filesystem_free_bytes{
nodename=~".*"})) / sum (node_filesystem_size_bytes{
nodename=~".*"})
Description: Monitor the disk usage of the K8S cluster
22, the cluster monitors the available space of the disk volume
PromQL statement:
kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes * 100 < 10
Note: If the available space of the monitoring disk volume is monitored by K8S, it will alarm if it is less than 10
23. Cluster monitoring predicts whether the disk volume is full within 7 days
PromQL statement:
predict_linear(kubelet_volume_stats_available_bytes[1h], 7 * 24 * 3600) < 0
Description: Monitor K8S cluster monitoring to predict whether the disk volume is full within 7 days, and alarm if it is less than 0
24. Cluster monitoring PV usage status monitoring
PromQL statement:
kube_persistentvolume_status_phase{
phase=~"Failed|Pending"} > 0
- kube_persistentvolume_status_phase: PV usage status
Description: Monitor the K8S cluster to monitor the PV usage status, if it is greater than 0, it will alarm
24. The cluster monitors whether the StatefulSet is down
PromQL statement:
(kube_statefulset_status_replicas_ready / kube_statefulset_status_replicas_current) != 1
Description: Monitor whether the StatefulSet of the K8S cluster is down, and alarm if it is less than 1
25, cluster monitoring HPA dynamic scaling abnormal
PromQL statement:
(sum(kube_hpa_status_condition{
condition="ScalingLimited",status="true"}) by (hpa,namespace)) == 1
Description: Monitor the HPA dynamic scaling abnormality of the K8S cluster, and it will alarm if it is equal to 1.
25. The cluster monitors the number of POD restarts in the current 5 minutes
PromQL statement:
rate(kube_pod_container_status_restarts_total[5m]) * 60 * 5 >2
Description: Monitor the number of times that the K8S cluster monitors POD restarts in the current 5 minutes. If it is greater than 2, it will alarm
25. The cluster monitors the status of replicaset replicas
PromQL statement:
kube_replicaset_spec_replicas != kube_replicaset_status_ready_replicas
Note: Monitoring the status of the number of replicasets in the K8S cluster monitors the number of replicas, and the exceptions are the
same:
kube_deployment_spec_replicas != kube_deployment_status_replicas_available
kube_statefulset_status_replicas_ready != kube_statefulset_status_replicas
kube_deployment_status_observed_generation != kube_deployment_metadata_generation
kube_statefulset_status_observed_generation != kube_statefulset_metadata_generation