Monitor key metrics for Kubernetes Node components

All Kubernetes components provide the /metrics interface to expose monitoring data, and Kube-Proxy is no exception. Through the ss or netstat command, you can see the ports that Kube-Proxy listens to. One is 10249, which is used to expose monitoring indicators, and the other is 10256, which is used as a port for health checks. Generally, we only focus on the previous port.

1. Kube-Proxy key indicators

1. General Go program-related indicators

 

The above indicators, as long as the program is buried through the Prometheus Go SDK, will have it, including Kube-Proxy, Kubelet, APIServer, Scheduler, etc.

2. Request APIServer indicators

Multiple components in Kubernetes must call the APIServer interface, how many calls per second, how many successes and failures, and the time-consuming situation, these indicators are also critical. for example:

  • rest_client_request_duration_seconds: request time-consuming statistics of APIServer
  • rest_client_requests_total: request APIServer call statistics

3. Rule synchronization indicators

The core function of Kube-Proxy is to obtain forwarding rules from APIServer and modify local iptables or ipvs rules, so it is very important to synchronize related indicators with these rules.

 2. Kubelet key indicators

Kubelet will also spit out general metrics related to Go process and metrics related to APIServer communication, similar to Kube-Proxy. The core function of Kubelet is to manage pods, operate various CNI and CSI-related interfaces, and deal with container engines. Measuring indicators for such operations is particularly critical.

 

3. Container load index

CPU metrics

sum(
irate(container_cpu_usage_seconds_total[3m])
) by (pod,id,namespace,container,ident,image)
/
sum(
container_spec_cpu_quota/container_spec_cpu_period
) by (pod,id,namespace,container,ident,image)

This is to calculate the CPU usage. The whole is a division operation. The numerator part is the CPU time consumed by the container per second, and the denominator part is the CPU time allocated to the container per second.

increase(container_cpu_cfs_throttled_periods_total[1m])
/
increase(container_cpu_cfs_periods_total[1m]) * 100

This is to calculate the proportion of time when the CPU is limited. If this value is high, it means that the container is often limited when using CPU resources, and the CPU Quota of this container needs to be increased. Latency-sensitive applications need to pay special attention to this indicator.

memory index

container_memory_working_set_bytes
/
container_spec_memory_limit_bytes
and
container_spec_memory_limit_bytes != 0

When calculating memory usage, the core is also a division operation, the numerator is the memory usage of the container, and the denominator is the memory limit size. Of course, some containers do not specify a memory limit, so an and statement is needed to limit it. Only when limit_bytes is not equal to 0, this division operation is meaningful.

Pod network traffic

irate(container_network_transmit_bytes_total[1m]) * 8
irate(container_network_receive_bytes_total[1m]) * 8

The name of this indicator is very clear, transmit is the outgoing direction, and receive is the incoming direction. Both indicators are Counter type values ​​and monotonically increasing, so irate is used to calculate the rate per second. Because network traffic generally uses bits as the unit, it is multiplied by 8 at the end to convert bytes into bits.

Pod hard disk IO read and write traffic

irate(container_fs_reads_bytes_total[1m])
irate(container_fs_writes_bytes_total[1m])

The name of this indicator can be seen as a Counter type. We don't care about the current value, but the rate per second in the most recent period, so we use irate to do a second calculation.

This article is a study note for Day 10 in August. The content comes from Geek Time "Operation and Maintenance Monitoring System Practical Notes". This course is recommended.

Guess you like

Origin blog.csdn.net/key_3_feng/article/details/132219468