When K8S cluster 1.24 uses docker as a container, intermittent readiness probe exceptions occur


1. Environment introduction

components Version
kubernetes 1.24.2
docker 18.03.1-ce
cri-docker 0.2.6

2. Abnormal information

  Recently, it was detected that the pod on the kubernetes cluster calico-nodehung up after running for 2 days. After restarting the cloud host node calico-nodewhere it was located , the service returned to normal, but it hung up again after 2 days. View the event information of calico-node, the error message is as follows:

(combined from similar events): Readiness probe errored: rpc error: code = Unknown 
desc = failed to exec in container: failed to create exec "d926d9226559a6673c1dbb904262c...398387ad3b04420": 
cannot exec in a stopped state: unknown

  kubernetes prompts that calico-node readiness detection failed.

3. Analyze the problem

3.1 kubernetes health check

  Kubernetes has three common health check probes, namely:

  • Liveness : liveness probe
  • Readiness : readiness probe
  • Startup : start the probe, new features introduced after version 1.18

3.1.1 Survival Probes

  The kubelet uses liveness probes to determine when to restart containers. For example, a liveness probe can detect an application deadlock (where the application is running but cannot proceed further). Restarting containers in this state can help improve the availability of applications, even if there are bugs in them.

3.1.2 Readiness Probe

  The kubelet uses the readiness probe to know when the container is ready to accept request traffic. When all the containers in a Pod are ready, the Pod can be considered ready. One use of this signal is to control which Pod is used as the backend of the Service. If the Pod is not ready, it will be removed from the Service's load balancer.

3.1.3 Start the probe

  The kubelet uses startup probes to know when application containers are started. If this type of probe is configured, you can control the container to perform a liveness and readiness check after it starts successfully, ensuring that these liveness and readiness probes will not affect the startup of the application. Startup probes can be used to perform liveness checks on slow-start containers, preventing them from being killed before they start running.

3.2 Detection method

  • httpGet: Send an HTTP request to the service in the container for health check
  • exec: Go to the container to execute commands and perform health checks
  • tcpSocket: Send a Socket (TCP protocol) request to the service in the container for health detection
  • grpc: Send a GRPC request to the service in the container for health check

  This time, the abnormality of the kubernetes cluster occurred in the readiness detection probe, using exec to detect the abnormality of the calico-node Pod, the Pod where the calico-node container is located reported the information that it was not ready, and did not accept the traffic through the Kubernetes Service, resulting in the calico-node being in the Running state, but the Ready instance is 0, making the service unavailable.

  By consulting relevant documents, it is guessed that the problem may occur when the container is running. Since kubernetes implements the CRI (Container Runtime Interface) standard container runtime interface, docker does not support the CRI standard interface. However, in order to be compatible with docker in the early days, kubernetes developed the docker shim to adapt docker container. After kubernetes 1.22, the relevant code of docker shim is removed, so if you want to continue to use docker as a container running in kubernetes 1.22 and later, you need to install the cri-docker service additionally. The current cri-docker service may not be very stable, so when the service runs for a few days, an abnormal situation will occur, causing the kubelet to use the readiness probe to perform an abnormal health check on the Pod.

4. Solutions

  If you want to continue to use docker in clusters after kubernetes 1.22, you may need to continue to wait for more optimizations from the open source community. Therefore, it is recommended to switch the container runtime and switch the docker container runtime to containerd. Containerd is actually a very good container runtime shared by docker to the open source community, and docker itself is also a container service for higher-level applications built on containerd.
   For the operation steps of switching from docker to containerd, please refer to: kubernetes upgrade container runtime from docker to containerd . After several days of continuous observation, it was found that the abnormal readiness probe error reported every 2 days had not been reproduced. The preliminary judgment was correct. Therefore, there is a certain risk in trying the latest version of kubernetes in the production environment. , Upgrade needs to be cautious, and do continuous observation in the test environment before upgrading.

Guess you like

Origin blog.csdn.net/hzwy23/article/details/129270220