K8S system monitoring-kube-state-metrics cluster resource monitoring

kube-state-metrics indicator data

1. kube-state-metrics description

kube-state-metrics focuses on obtaining the latest state of various resources of k8s, such as deployment or daemonset. The reason why kube-state-metrics is not included in the capability of metric-server is that their focus is essentially different of. The metric-server just obtains and formats existing data and writes them into specific storage. It is essentially a monitoring system. And kube-state-metrics takes a snapshot of the running status of k8s in memory and obtains new metrics, but it does not have the ability to export these metrics.

2. Monitoring the number of Node nodes

PromQL statement:

kube_node_info{
    
    instance="10.42.4.65:8080"}

Note: You can query how many Node nodes the K8S cluster corresponds to, compare the alarm with the actual number of nodes, or make a large screen

3. The cluster node status is wrong

PromQL statement:

kube_node_status_condition{
    
    condition="Ready",status!="true"}==1

Description: Monitor whether the status of the cluster node is wrong, if the value is 1, it can be alarmed if there is an error

4. Is the cluster node status ready?

PromQL statement:

kube_node_status_condition{
    
    condition="Ready",status="true"} == 0

Description: Monitoring whether the status of the cluster node is ready, a bit like the status obtained by kubectl get node

4.1. Is there a shortage of cluster node memory or disk resources?

PromQL statement:

kube_node_status_condition{
    
    condition=~"OutOfDisk|MemoryPressure|DiskPressure",status!="false"}==1

5. There is a failed PVC monitoring in the cluster

PromQL statement:

kube_persistentvolumeclaim_status_phase{
    
    phase="Failed"}==1

6. Pod monitoring that failed to start in the cluster

PromQL statement:

kube_pod_status_phase{
    
    phase=~"Failed|Unknown"}==1

7. Pod container restart monitoring in the last 30 minutes

PromQL statement:

changes(kube_pod_container_status_restarts_total[30m])

8. The total number of cores requested by the node CPU (unit cores)

PromQL statement:

sum(kube_pod_container_resource_requests_cpu_cores{
    
    })

Description: Monitor the number of CPU cores requested by the container

9. Node CPU limits the total number of cores

PromQL statement:

sum(kube_pod_container_resource_limits_cpu_cores{
    
    })

10. The total amount of node CPU

PromQL statement:

sum(kube_node_status_capacity_cpu_cores{
    
    })

11.Node memory request value (GB after bytes 1024/1024/1024)

PromQL statement:

sum(kube_pod_container_resource_requests_memory_bytes{
    
    })/1024/1024/1024

12. Node memory limit value

PromQL statement:

sum(kube_pod_container_resource_limits_memory_bytes{
    
    })

13. Total node memory

PromQL statement:

sum(kube_node_status_capacity_memory_bytes{
    
    }/1024/1024/1024)

14. Node unavailability monitoring

PromQL statement:

sum(kube_node_spec_unschedulable{
    
    node=~"$node"})

15. Pod life cycle monitoring

PromQL statement:

kube_pod_status_phase{
    
    phase=~"Pending|Running"} == 1

Description: kube_pod_status_phase can count the number of Pods, the value of phase: Running (running) Pod has been bound to a node, and all containers in the Pod have been created. At least one container is still running, or is being started or restarted. The life cycle description of Pod on the official website:
https://kubernetes.io/zh/docs/concepts/workloads/pods/pod-lifecycle/
PromQL statement:

sum(kube_pod_status_phase{
    
    namespace=~".*", phase="Pending"}==1)

Description: Monitor the number of pending Pods; Pod has been accepted by the Kubernetes system, but one or more containers have not been created or run. This stage includes the time to wait for the Pod to be scheduled and the time to download the mirror through the network
PromQL statement:

sum(kube_pod_status_phase{
    
    namespace=~".*", phase="Failed"}==1)

Description: Monitor the number of successfully terminated Pods; all containers in the Pod have been terminated, and at least one container was terminated due to failure. In other words, the container exits with a non-zero status or is terminated by the system
PromQL statement:

sum(kube_pod_status_phase{
    
    namespace=~".*", phase="Succeeded"}==1)

Description: Monitor the number of successfully terminated Pods; all containers in the Pod have been successfully terminated, and the
PromQL statement will not be restarted :

sum(kube_pod_status_phase{
    
    namespace=~".*", phase="Unknown"}==1)

Description: Monitor the number of unknown Pods; Pod status cannot be obtained for some reasons. This situation is usually due to a failure to communicate with the host where the Pod is located

16.Monitor the running containers of K8S

PromQL statement:

kube_pod_container_status_running{
    
    namespace=~".*"}==1

Description: Monitor the number of running containers; you can plot the number of containers in the K8S cluster

17.Monitor K8S containers waiting to be created

PromQL statement:

kube_pod_container_status_waiting{
    
    namespace=~".*"}==1

Description: Monitor the containers that K8S is waiting to create; you can monitor the containers

18.Monitor K8S stopped containers

PromQL statement:

kube_pod_container_status_terminated{
    
    namespace=~".*"}==1

Description: Monitor K8S stopped containers; you can monitor containers

19. The number of successful monitoring operations center

PromQL statement:

sum(kube_job_status_succeeded{
    
    namespace=~".*"})

Description: Monitor the number of successes in the job center; you can monitor the number of successful job executions

20. Monitor the number of copies of each deployment

PromQL statement:

sum(kube_deployment_status_replicas{
    
    namespace=~".*"})

Description: kube_deployment_status_replicas represents the number of replicas of each deployment, this value is Status.Replicas;
kube_deployment_spec_replicas represents the number of pods required for deployment. This value is the number of Spec.Replicas resource definition replicas
kube_deployment_status_replicas_available The number of running replicas
kube_deployment_status_replicas_updated The number of updated replicas
kube_deployment_status_replicas_unavailable The number of unavailable replicas

21. Cluster disk usage

PromQL statement:

(sum (node_filesystem_size_bytes{
    
    nodename=~".*"}) - sum (node_filesystem_free_bytes{
    
    nodename=~".*"})) / sum (node_filesystem_size_bytes{
    
    nodename=~".*"})

Description: Monitor the disk usage of the K8S cluster

22, the cluster monitors the available space of the disk volume

PromQL statement:

kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes * 100 < 10

Note: If the available space of the monitoring disk volume is monitored by K8S, it will alarm if it is less than 10

23. Cluster monitoring predicts whether the disk volume is full within 7 days

PromQL statement:

predict_linear(kubelet_volume_stats_available_bytes[1h], 7 * 24 * 3600) < 0

Description: Monitor K8S cluster monitoring to predict whether the disk volume is full within 7 days, and alarm if it is less than 0

24. Cluster monitoring PV usage status monitoring

PromQL statement:

kube_persistentvolume_status_phase{
    
    phase=~"Failed|Pending"} > 0

kube_persistentvolume_status_phase: PV usage status

Description: Monitor the K8S cluster to monitor the PV usage status, if it is greater than 0, it will alarm

24. The cluster monitors whether the StatefulSet is down

PromQL statement:

(kube_statefulset_status_replicas_ready / kube_statefulset_status_replicas_current) != 1

Description: Monitor whether the StatefulSet of the K8S cluster is down, and alarm if it is less than 1

25, cluster monitoring HPA dynamic scaling abnormal

PromQL statement:

(sum(kube_hpa_status_condition{
    
    condition="ScalingLimited",status="true"}) by (hpa,namespace)) == 1

Description: Monitor the HPA dynamic scaling abnormality of the K8S cluster, and it will alarm if it is equal to 1.

25. The cluster monitors the number of POD restarts in the current 5 minutes

PromQL statement:

rate(kube_pod_container_status_restarts_total[5m]) * 60 * 5 >2

Description: Monitor the number of times that the K8S cluster monitors POD restarts in the current 5 minutes. If it is greater than 2, it will alarm

25. The cluster monitors the status of replicaset replicas

PromQL statement:

kube_replicaset_spec_replicas != kube_replicaset_status_ready_replicas

Note: Monitoring the status of the number of replicasets in the K8S cluster monitors the number of replicas, and the exceptions are the
same:

kube_deployment_spec_replicas != kube_deployment_status_replicas_available

kube_statefulset_status_replicas_ready != kube_statefulset_status_replicas

kube_deployment_status_observed_generation != kube_deployment_metadata_generation

kube_statefulset_status_observed_generation != kube_statefulset_metadata_generation