Fault analysis | Kubernetes fault diagnosis process

1. Overview and main terms of this article

1.1 Overview

This article is divided based on the three modules of Pod, Service and Ingress . For the possible daily failures of Kubernetes, it provides more specific troubleshooting steps, and attaches relevant solutions or references.

1.2 Key terms

  1. Pod: The smallest deployable computing unit created and managed in Kubernetes. is a group (one or more) of containers; these containers share storage, networking, and a declaration of how to run them.

  2. Port-forward: Map the local port to the specified application port through port forwarding.

  3. Service: A Kubernetes Service is an abstraction that defines a logical collection of Pods and a policy for accessing them - sometimes called a microservice.

  4. Ingress: Provides routing communication from outside the cluster to internal HTTP and HTTPS services. Traffic routing is controlled by rules defined on the Ingress resource.

2. Fault diagnosis process

2.1 Pods module check

  • If the following process succeeds, continue down, and if it fails, jump according to the prompt.

2.1.1 Check if any pod is in PENDING state

  1. kubectl get pods: If there is a pod in the PENDING state, look down, otherwise go to 2.1.5.

[root@10-186-65-37 ~]# kubectl get pods
NAME READY STATUS RESTARTS AGE
myapp-deploy-55b54d55b8-5msx8 0/1 Pending 0 5m
 

2 kubectl describe pod <pod-name>: If the detailed information of one or more specified resources is correctly output, then it will be judged whether the cluster resources are insufficient, if not, expand it, otherwise go to 2.1.2.

2.1.2 Check whether the ResourceQuota limit is triggered

  1. kubectl describe resourcequota -n <namespace>:

 

2 If there are restrictions, release or expand corresponding resources, refer to: https://kubernetes.io/zh/docs/concepts/configuration/manage-resources-containers/#extended-resources

Otherwise go to 2.1.3

2.1.3 Check if there is a PVC in PENDING state

  1. A persistent volume (PersistentVolume, PV) is a piece of storage in the cluster, which can be provisioned in advance by the administrator, or dynamically provisioned using a storage class (Storage Class); a persistent volume claim (PersistentVolumeClaim, PVC) expresses a user's request for storage.

  2. kubectl describe pvc <pvc-name>: if STATUS is Pending

 

Then refer to the following link to solve it: https://kubernetes.io/zh/docs/concepts/storage/persistent-volumes/

Otherwise go to 2.1.4.

2.1.4 Check whether the pod is assigned to the node

  1. kubectl get pods -o wide: if it has been assigned to node

 

It is a problem with the Scheduler, please refer to the following link to solve it: https://kubernetes.io/zh/docs/concepts/scheduling-eviction/kube-scheduler/

Otherwise it's a Kubectl problem.

2.1.5 Check if pods are in RUNNING state

  1. kubectl get pods -o wide: If the pods are in RUNNING state, go to 2.1.10, otherwise go to 2.1.6.

2.1.6 Check pod logs

  1. kubectl logs <pod-name>:

If the log can be obtained correctly, fix related problems according to the log.

  1. If the log cannot be obtained, judge whether the container stops running quickly. If it stops running quickly, execute: kubectl logs <pod-name> --previous

  2. If the log cannot be obtained, and the container is not stopped quickly, go to 2.1.7

2.1.7 Whether the Pod status is ImagePullBackOff

  1. kubectl describe pod <pod-name>: Check whether the status is ImagePullBackOff? If not ImagePullBackOff go to 2.1.8.

  2. Check if the image name is correct, and correct the error.

  3. Check to see if the image tag exists and is validated.

  4. Do you want to pull mirrors from private registry? If so, confirm that the configuration information is correct.

  5. If the image is not being pulled from a private registry, the problem may be with CRI (Container Runtime Interface) or kubectl.

2.1.8 Whether the Pod status is CrashLoopBackOff

  1. kubectl describe pod <pod-name>: Check whether the status is CrashLoopBackOff? Otherwise go to 2.1.9.

  2. If so then look at the logs and fix the app crash.

  3. Are you missing a CMD command in the Dockerfile? Docker history < image-id > (you can add --no-trunc to display the complete output)

 

 

  1. Is the pod state frequently restarted and the state is switching between Running and CrashLoopBackOff? If so, you need to fix the liveness probe (survival probe), please refer to the following link: https://kubernetes.io/zh/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/

2.1.9 Whether the Pod status is RunContainerError

  1. kubectl describe pod <pod-name>: Check whether the status is RunContainerError.

  2. If the status is RunContainerError, the problem may be caused by mounting the volume (volume), please refer to the following link: https://kubernetes.io/zh/docs/concepts/storage/volumes/

  3. Otherwise, seek help on sites such as StackOverflow.

2.1.10 Check if pods are in READY state

  1. If it is in the READY state, continue to execute the mapping settings

 

If there are no pods in READY state, go to 2.1.11.

2. kubectl port-forward <pod-name> 8080:<pod-port> 

3. Mapping successfully go to 2.2.

a) to map

 b) Verify that the mapping is successful

 

If it fails, you need to confirm that the program can be monitored by all addresses. The setting statement is as follows

 

If it cannot be monitored by all addresses, it is in an unknown state (Unknown state).

2.1.11 Checking Readiness (Readiness Probe)

  1. kubectl describe pod <pod-name>

  2. For normal output, fix the corresponding problem according to the log and refer to the following link https://kubernetes.io/zh/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/

  3. Failure is unknown state (Unknown state).

2.2 Service module inspection

2.2.1 Service Current Status Check

  1. The successful output of kubectl describe service <service-name> is as follows

 

2 Can you see the Endpoints column and have normal output? For abnormal output, go to 2.2.2.

3 kubectl port-forward service/<service-name> 8080:<service-port> The successful output is as follows

 

4 If successful, go to 2.3, and if failed, go to 2.2.4.

2.2.2 Selector and Pod label comparison

  1. View the label information of the pod kubectl describe pod <pod-name>

 

2 View the selector information of the service kubectl describe service <service-name>

3 Compare the two to see if they match correctly, correct them if they are wrong, and go to 2.2.3 if they are correct.

2.2.3 Check whether the Pod has been assigned an IP

  1. View pod's ip information kubectl describe pod <pod-name>

  2. The ip is assigned correctly, then the problem is due to kubectl.

 

 

3 If ip is not assigned, the problem is caused by Controller manager.

2.2.4 Check Service TargetPort and Pod ContainerPort

  1. View the TargetPort information of the service: kubectl describe service <service-name>

 

2 View the ContainerPort information of the pod: kubectl describe pod <pod-name>

 

3 If the above two are consistent, the problem is caused by kube-proxy, and if they are not consistent, correct the information.

2.3 Ingress module check

2.3.1 Ingress Current Status Check

  1. The successful output of kubectl describe ingress <ingress-name> is as follows:

 

2 Can you see the backends column and have normal output? Normal output goes to 2.3.4, otherwise goes to 2.3.2.

2.3.2 Check ServiceName and ServicePort

  1. kubectl describe ingress <ingress-name>

  2. kubectl describe service <service-name>

 

3 Check whether the ServiceName and ServicePort of the first two are correct, if correct, go to 2.3.3, please correct the error.

2.3.3 Ingress controller documentation

  1. The problem is caused by the Ingress controller, please refer to the documentation to find a solution: https://kubernetes.io/docs/concepts/services-networking/ingress-controllers/

2.3.4 Check port-forward ingress

  1. kubectl port-forward <ingress-pod-name> 8080:<ingress-port> Test whether it can be accessed normally: curl localhost:8080

For normal access, go to 2.3.5, otherwise go to 2.3.3.

2.3.5 Check whether it can be accessed through Ingress on the external network

  1. It can be successfully accessed from the external network, and the troubleshooting is over.

  2. If it cannot be accessed from the external network, the problem is caused by infrastructure or cluster exposure. Please troubleshoot.

references 

Fault analysis | Kubernetes fault diagnosis process (qq.com)

Guess you like

Origin blog.csdn.net/zhangchang3/article/details/131573186