Cloud native in-depth analysis of how Kubectl Top performs resource monitoring in Kubernetes

1. Use of Kubectl top

  • kubectl top is a basic command, but you need to deploy supporting components to get monitoring values:
    • Below 1.8: deploy heapter;
    • 1.8 or above: deploy metric-server;
  • kubectl top node: Check the usage of node:

insert image description here

  • kubectl top pod: Check the usage of pod:

insert image description here

  • If no pod name is specified, all pods under the namespace will be displayed, and –containers can display all containers in the pod:

insert image description here

  • Indicator meaning:
    • Consistent with request and limit in k8s, CPU unit 100m=0.1 memory unit 1Mi=1024Ki;
    • The pod's memory value is its actual usage, and it is also the basis for judging oom when limiting the limit. The pod's usage is equal to the sum of all its business containers, excluding pause containers, and its value is equal to the container_memory_working_set_bytes indicator in cadvisr;
    • The value of a node is not equal to the sum of all pod values ​​on the node, nor is it equal to the value seen by running top or free directly on the machine.

2. Implementation principle

① Data link

  • The data used by scheduling components such as kubectl top, k8s dashboard, and HPA are the same, and the data links are as follows:

insert image description here

  • When using heapster: the apiserver will directly forward the metric request to the hepaster service in the cluster through proxy:

insert image description here

  • When using metrics-server: apiserver accesses metrics through the address of /apis/metrics.k8s.io/:

insert image description here

  • Compare the log when kubect get pod:

insert image description here

② metric api

  • It can be found that heapster uses proxy forwarding, while both metric-server and ordinary pods use the resource interface of api/xx. The proxy method adopted by heapster is problematic:
    • proxy is just a proxy request, generally used for troubleshooting, not stable enough, and the version is not controllable;
    • The interface of heapster cannot have complete authentication and client integration like apiserver, and it will be expensive to maintain both sides, such as generic apiserver;
    • The monitoring data of the pod is the core indicator (HPA scheduling), which should have the same status as the pod itself, that is, the metric should exist as a resource, such as in the form of metrics.k8s.io, called Metric Api;
  • The official has gradually abandoned heapster since version 1.8, and proposed the concept of Metric API above, and metrics-server is an official implementation of this concept, which is used to obtain indicators from kubelet and replace the previous heapster.

③ kube-aggregator

  • With the metrics-server component, the required data is collected and the interface is exposed, but there is no difference between this step and the heapster. The most critical step is how to forward the /apis/metrics.k8s.io request to the apiserver For the metrics-server component? The solution is: kube-aggregator.
  • kube-aggregator is a powerful extension to apiserver, which allows k8s developers to write their own service and register this service in the k8s api, that is, to extend the API. In fact, metric-server has been completed in version 1.7, but in Wait for the kube-aggregator to appear. kube-aggregator is the implementation in apiserver. Some k8s versions are not enabled by default. You can add these configurations to enable it. Its core functions are dynamic registration, discovery summary, and security proxy.

insert image description here

  • For example, when metric-server registers pods and nodes:

insert image description here

④ Monitoring system

  • When proposing the concept of metric api, the official also proposed a new monitoring system. Monitoring resources are divided into two types:
    • Core metrics (core indicators): Obtain measurement data from Kubelet, cAdvisor, etc., and then provide it to Dashboard, HPA controller, etc. by metrics-server;
    • Custom Metrics: The API custom.metrics.k8s.io is provided by the Prometheus Adapter, which can support any metrics collected by Prometheus.

insert image description here

  • The core indicators only include the cpu and memory of node and pod. Generally speaking, the core indicators are enough for HPA, but if you want to implement HPA based on custom indicators: such as the number of qps/5xx errors requested, you need to use custom indicators. At present, custom indicators in Kubernetes are generally provided by Prometheus, and then aggregated to apiserver by using k8s-prometheus-adpater to achieve the same effect as core indicators.

⑤ Kubelet

  • As mentioned earlier, both the heapster and the metric-server are just data transfer and aggregation. Both are data obtained by calling the api interface of the kubelet, and the actual collection of indicators in the kubelet code is the cadvisor module, which can be used in the node node Access port 10255 (port 10250 after version 1.11) to obtain monitoring data:
    • Kubelet Summary metrics: 127.0.0.1:10255/metrics, which exposes node and pod summary data;
    • Cadvisor metrics: 127.0.0.1:10255/metrics/cadvisor, exposing container dimension data;
  • As shown below, the memory usage of the container:

insert image description here

  • Although Kubelet provides a metric interface, the actual monitoring logic is in charge of the built-in cAdvisor module. The evolution process is as follows:
    • Starting from k8s 1.6, kubernetes starts to integrate cAdvisor into kubelet, no separate configuration is required;
    • Starting from k8s 1.7, Kubelet metrics API no longer includes cadvisor metrics, but provides an independent API interface for summary;
    • Starting from k8s 1.12, the port monitored by cadvisor is deleted in k8s, and all monitoring data is provided by Kubelet's API.

⑥ cadvisor

  • Cadvisor is open sourced by Google and developed using Go. Cadvisor can not only collect information about all running containers on a machine, including CPU usage, memory usage, network throughput, and file system usage, but also provide basic query interface and http interface. It is convenient for other components to fetch data.
  • In K8S, it is integrated in Kubelet as the default startup item, which is the official standard configuration of k8s. An example of the data structure obtained by cadvisor:

insert image description here

  • The core logic is to create a manager instance through the new memoryStorage and sysfs instances. The interface of the manager defines many functions for obtaining container and machine information:

insert image description here

  • Interpretation of cadvisor's indicators: cgroup-v1 (https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt), when cadvisor obtains indicators, it actually calls the runc/libcontainer library, and libcontainer is for cgroup The package of the file, that is, cadvsior is only a forwarder, and its data comes from the cgroup file.

⑦ cgroup

  • The values ​​in the cgroup file are the ultimate source of monitoring data, such as:
    • The value of mem usage comes from /sys/fs/cgroup/memory/docker/[containerId]/memory.usage_in_bytes;
    • If there is no memory limit, Limit=machine_mem, otherwise it comes from /sys/fs/cgroup/memory/docker/[id]/memory.limit_in_bytes;
    • memory usage = memory.usage_in_bytes/memory.limit_in_bytes.
  • In general, the contents of the cgroup folder include CPU, memory, disk, network and other information:
devices:设备权限控制。
cpuset:分配指定的 CPU 和内存节点。
cpu:控制 CPU 占用率。
cpuacct:统计 CPU 使用情况。
memory:限制内存的使用上限。
freezer:冻结(暂停)Cgroup 中的进程。
net_cls:配合 tc(trafficcont.roller)限制网络带宽。
net_prio:设置进程的网络流量优先级。
huge_t1b:限制 HugeTLB 的使用。
perf_event:允许 PeIf 工具基于 Cgroup 分组做性能监测。
  • The meanings of several commonly used indicators under memory:
memory. usage_in_bytes      已使用的内存量(包含cache和buffeT)(字节),相当于1inux的usedmen
memory.limit_in_bytes       限制的内存总量(字节),相当于1inux的total_mem
memory.failent              申请内存失败次数计数
memory-mensw.usage_in_bytes 已使用的内存和swap(字节)
memory.memsw.Limit_in_bytes 限制的内存和swap容量(字节)
memory. memsw.failcnt       申请内存和swap失败次数计数
memory. stat                内存相关状态
  • The information in memory.stat is the most complete:

insert image description here

3. Problem analysis

① Why does kubectl top report an error?

  • In general, there are the following top error reports, and you can see the specific call log with kubectl top pod -v=10:
    • If the heapster or metric-server is not deployed, or the pod is running abnormally, you can check the corresponding pod log;
    • The pod to be viewed has just been built, and the indicator has not been collected yet, and a not found error is reported. The default is 1 minute.
  • Neither of the above two, you can check whether port 10255 of kubelet is open. By default, this read-only port will be used to obtain indicators. You can also add a certificate to the configuration of heapster or metric-server and replace it with 10250 authentication port.

② How to calculate kubectl top pod memory, including pause container?

  • Every time a pod is started, there will be a pause container. Since it is a container, there must be resource consumption (generally 2-3M memory). In the cgroup file, the business container and the pause container are under the same pod folder.
  • However, when the cadvisor queries the memory usage of the pod, it first obtains the container list under the pod, and then obtains the memory usage of the containers one by one. However, the container list does not include pause, so the final result of the top pod does not include the pause container pod. Calculation of memory usage The memory usage obtained by kubectl top pod is not container_memory_usage_bytes in cadvisor, but container_memory_working_set_bytes. The calculation method is:
    • container_memory_usage_bytes = container_memory_rss + container_memory_cache + kernel memory
    • container_memory_working_set_bytes = container_memory_usage_bytes – total_inactive_file (inactive anonymous cache pages).
  • container_memory_working_set_bytes is the actual amount of memory used by the container, and it is also the oom judgment basis when the limit is limited. The container_memory_usage_bytes in cadvisor corresponds to the memory.usage_in_bytes file in cgroup, but there is no specific file for container_memory_working_set_bytes, and its calculation logic is in the code of cadvisor, as follows:

insert image description here

  • Similarly, the memory usage of node is also container_memory_working_set_bytes.

③ How is kubectl top node calculated, and what is the difference from direct top on the node?

  • The cpu and memory values ​​obtained by kubectl top node are not the sum of all pods on the node, so do not add them directly. top node is the summary statistics under the cgroup root directory on the machine:

insert image description here

  • The value seen directly on the top command on the machine cannot be directly compared with the kubectl top node, because the calculation logic is different, such as memory, the approximate corresponding relationship is (the former is top on the machine, and the latter is kubectl top):
rss + cache = (in)active_anon + (in)active_file

insert image description here

④ Is the top displayed after kubectl top pod and exec enter the pod different?

  • The difference of the top command is the same as above, and cannot be directly compared. At the same time, even if the limit is set on the pod, the total amount of memory and CPU seen by top in the pod is still the total amount of the machine, not the amount that can be allocated by the pod:
    • The RSS of the process is all the physical memory used by the process (file_rss+anon_rss), that is, Anonymous pages+Mapped apges (including shared memory);
    • cgroup RSS is (anonymous and swap cache memory), does not include shared memory. Neither contain file cache.

⑤ Why are the values ​​obtained by kubectl top pod and docker stats different?

  • docker stats dockerID can see the current usage of the container:

insert image description here

  • If there is only one container in the pod, you can find that the value of docker stats is not equal to the value of kubectl top, neither container_memory_usage_bytes nor container_memory_working_set_bytes, because the calculation methods of docker stats and cadvisor are different, and the overall value will be smaller than kubectl top. The calculation logic is:
docker stats = container_memory_usage_bytes - container_memory_cache

Four. Summary

  • In general, you don’t need to always care about the usage of nodes or pods, because there are cluster-autoscaler and pod horizontal scaling (HPA) to deal with these two kinds of resource changes, and resource indicators are more meaningful It is suitable to use prometheus to persist the data of cadvisor for backtracking history or sending alarms.
  • Other additions:
    • Although kubectl top help shows that Storage is supported, it is still not supported until version 1.16;
    • Heapster is required before 1.13, and metric-server is required after 1.13. The output of this part of kubectl top help is wrong, and only heapster is mentioned in it;
    • The monitoring graph in the k8s dashboard uses heapster by default. After switching to metric-server, the data will be abnormal. It is necessary to deploy an additional metric-server-scraper pod for interface conversion. For details, refer to: dashboard .

Guess you like

Origin blog.csdn.net/Forever_wj/article/details/131282501