EBook Download | Super practical! A collection of K8s troubleshooting cases of Ali after-sales experts

Book.png

<Follow the public account, reply to "Troubleshoot" to get the download link>

"Introduction to Kubernetes" open download

Author Luojian Long (nickname sound East ), Ali cloud technology expert, has many years of operating system and graphics card driver debugging and development experience. Currently focusing on cloud native areas, container clusters and service grids. This book is divided into theoretical articles and practical articles, a total of 12 technical articles, in-depth analysis of cluster control, cluster scaling principles, mirror pulling and other theories. , Easy to use Kubernetes!

There are four highlights in this book:

  • The precipitation of real cases in Shanghai
  • The perfect fit of theory and practice
  • Theoretical elaboration
  • Investigate the technical details

Help you understand 6 core principles at a time, thoroughly understand the basic theory, and learn the gorgeous operation of 6 typical problems at once!

Directory.png
(Table of contents)

How to download for free?

Follow "Alibaba Cloud Native" and reply ** Troubleshooting ** to download this book for free.

Foreword

The following content is excerpted from the book "In-depth Kubernetes".

Alibaba Cloud has its own Kubernetes container cluster product. With the sharp increase in Kubernetes cluster shipments, online users sporadically found that the cluster will have a very low probability of NotNoty nodes.

According to our observation, this problem will be encountered by one or two customers almost every month. After the node NotReady, the cluster master has no way to do any control on this node, such as issuing new Pods, or for example, capturing real-time information about the running Pods on the node.

In the above problem, our troubleshooting path is from K8s cluster to container runtime, to sdbus and systemd, it is not uncomplicated. This problem has now been fixed in systemd, so basically the chance of seeing this problem is getting lower and lower.

However, there are still problems with cluster node readiness, but the reasons are different.

Today's article will focus on sharing another example of cluster node NotReady. This problem is completely different from the problem above.

Problem phenomenon

The phenomenon of this problem is that the cluster nodes will become NotReady state. The problem can be solved temporarily by restarting the node, but after about 20 days, the problem will reappear.

1.png

After the problem occurs, if we restart the kubelet on the node, the node will become Ready, but this state will only last for three minutes. This is a special case.

Big logic

Before analyzing this issue in detail, let's first look at the big logic behind the ready state of cluster nodes. In the K8s cluster, there are four main components related to the node ready state, which are: the core database etcd of the cluster, the API server of the cluster entrance, the node controller, and the kubelet that resides directly on the cluster node to manage the node.

2.png

On the one hand, kubelet plays the role of a cluster controller. It periodically obtains information about Pod and other related resources from the API Server and controls the execution of Pods running on the node according to the information; on the other hand, kubelet serves as a node status monitor It can obtain the node information and synchronize these conditions to the API Server in the role of the cluster client.

In this problem, kubelet plays the second role.

Kubelet will use the NodeStatus mechanism in the above figure to regularly check the status of the cluster nodes and synchronize the node status to the API Server. The main basis for NodeStatus to judge the ready status of a node is PLEG.

PLEG is the abbreviation of Pod Lifecycle Events Generator. Basically, its execution logic is to periodically check the Pod operation on the node. If it finds a change of interest, PLEG will wrap this change into an Event and send it to the main synchronization mechanism syncLoop of Kubelet. To deal with. However, when the Pod check mechanism of PLEG cannot be executed regularly, the NodeStatus mechanism will think that the status of this node is wrong, and thus synchronize this status to the API Server.

It is the node control component that finally implements the node status reported by kubelet to the node status. Here I deliberately distinguished the node status reported by kubelet from the final status of the node. Because the former is actually the Condition we saw when describing the node, while the latter is the NotReady state in the real node list.

3.png

Ready for three minutes

After the problem occurs, we restart the kubelet, and the node will not become NotReady after three minutes. This phenomenon is a key entry point for the problem.

4.png

Before explaining it, please take a look at the official PLEG diagram. This picture mainly shows two processes.

  • On the one hand, kubelet acts as a cluster controller, obtains pod spec changes from the API Server, and then creates or ends pods by creating worker threads;
  • On the other hand, PLEG regularly checks the container status, and then feeds the status back to kubelet in the form of events.

Here, PLEG has two key time parameters: one is the check execution interval, and the other is the check timeout time. By default, the PLEG inspection will be separated by one second. In other words, after each inspection process is executed, PLEG will wait one second and then perform the next inspection; and the timeout for each inspection is three minutes. The PLEG check operation cannot be completed within three minutes, then this situation will be used by the NodeStatus mechanism mentioned in the previous section as the credentials of the cluster node NotReady and synchronized to the API Server.

The reason why we observed that the node will be ready three minutes after restarting the kubelet is because after the kubelet restarts, the first PLEG check operation did not end successfully. The node is in a ready state and is not synchronized to the cluster until after three minutes of timeout.

As shown in the figure below, the upper line indicates the execution flow of PLEG under normal circumstances, and the lower line indicates a problem. relist is the main function of the check.

5.png

PLEG

Knowing the principle, let's take a look at PLEG's log. The log can basically be divided into two parts, where the skipping pod synchronization is output by the kubelet synchronization function syncLoop, indicating that it skipped a pod synchronization; and the remaining PLEG is not healthy: pleg was last seen active ago; threshold is 3m0s, It clearly shows that the relist timeout problem mentioned in the previous section is three minutes.

17:08:22.299597 kubelet skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m0.000091019s ago; threshold is 3m0s]
17:08:22.399758 kubelet skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m0.100259802s ago; threshold is 3m0s]
17:08:22.599931 kubelet skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m0.300436887s ago; threshold is 3m0s]
17:08:23.000087 kubelet skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m0.700575691s ago; threshold is 3m0s]
17:08:23.800258 kubelet skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m1.500754856s ago; threshold is 3m0s]
17:08:25.400439 kubelet skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m3.100936232s ago; threshold is 3m0s]
17:08:28.600599 kubelet skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m6.301098811s ago; threshold is 3m0s]
17:08:33.600812 kubelet skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m11.30128783s ago; threshold is 3m0s]
17:08:38.600983 kubelet skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m16.301473637s ago; threshold is 3m0s]
17:08:43.601157 kubelet skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m21.301651575s ago; threshold is 3m0s]
17:08:48.601331 kubelet skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m26.301826001s ago; threshold is 3m0s]

It is the call stack of the kubelet that can directly see the execution of the relist function. As long as we send the SIGABRT signal to the kubelet process, the golang runtime will help us output all the call stacks of the kubelet process. It should be noted that this operation will kill the kubelet process. But because of this problem, restarting kubelet will not destroy the reproduction environment, so it has little effect.

The following call stack is the call stack of the PLEG relist function. From bottom to top, we can see that relist waits to get PodSandboxStatus through grpc.

kubelet: k8s.io/kubernetes/vendor/google.golang.org/grpc/transport.(*Stream).Header()
kubelet: k8s.io/kubernetes/vendor/google.golang.org/grpc.recvResponse()
kubelet: k8s.io/kubernetes/vendor/google.golang.org/grpc.invoke()
kubelet: k8s.io/kubernetes/vendor/google.golang.org/grpc.Invoke()
kubelet: k8s.io/kubernetes/pkg/kubelet/apis/cri/runtime/v1alpha2.(*runtimeServiceClient).PodSandboxStatus()
kubelet: k8s.io/kubernetes/pkg/kubelet/remote.(*RemoteRuntimeService).PodSandboxStatus()
kubelet: k8s.io/kubernetes/pkg/kubelet/kuberuntime.instrumentedRuntimeService.PodSandboxStatus()
kubelet: k8s.io/kubernetes/pkg/kubelet/kuberuntime.(*kubeGenericRuntimeManager).GetPodStatus()
kubelet: k8s.io/kubernetes/pkg/kubelet/pleg.(*GenericPLEG).updateCache()
kubelet: k8s.io/kubernetes/pkg/kubelet/pleg.(*GenericPLEG).relist()
kubelet: k8s.io/kubernetes/pkg/kubelet/pleg.(*GenericPLEG).(k8s.io/kubernetes/pkg/kubelet/pleg.relist)-fm()
kubelet: k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc420309260)
kubelet: k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil()
kubelet: k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.Until()

Using PodSandboxStatus to search the kubelet call stack, it is easy to find the thread below. This thread is the thread that really queries the status of the Sandbox. From the bottom up, we will find that this thread tries to get a Mutex in the Plugin Manager.

kubelet: sync.runtime_SemacquireMutex()kubelet: sync.(*Mutex).Lock()
kubelet: k8s.io/kubernetes/pkg/kubelet/dockershim/network.(*PluginManager).GetPodNetworkStatus()
kubelet: k8s.io/kubernetes/pkg/kubelet/dockershim.(*dockerService).getIPFromPlugin()
kubelet: k8s.io/kubernetes/pkg/kubelet/dockershim.(*dockerService).getIP()
kubelet: k8s.io/kubernetes/pkg/kubelet/dockershim.(*dockerService).PodSandboxStatus()
kubelet: k8s.io/kubernetes/pkg/kubelet/apis/cri/runtime/v1alpha2._RuntimeService_PodSandboxStatus_Handler()
kubelet: k8s.io/kubernetes/vendor/google.golang.org/grpc.(*Server).processUnaryRPC()
kubelet: k8s.io/kubernetes/vendor/google.golang.org/grpc.(*Server).handleStream()
kubelet: k8s.io/kubernetes/vendor/google.golang.org/grpc.(*Server).serveStreams.func1.1()
kubelet: created by k8s.io/kubernetes/vendor/google.golang.org/grpc.(*Server).serveStreams.func1

And this Mutex is only useful in Plugin Manager, so we check all the call stacks related to Plugin Manager. Some of the threads are waiting for Mutex, while the rest are waiting for the Terway cni plugin.

kubelet: syscall.Syscall6()kubelet: os.(*Process).blockUntilWaitable()
kubelet: os.(*Process).wait()kubelet: os.(*Process).Wait()
kubelet: os/exec.(*Cmd).Wait()kubelet: os/exec.(*Cmd).Run()
kubelet: k8s.io/kubernetes/vendor/github.com/containernetworking/cni/pkg/invoke.(*RawExec).ExecPlugin()
kubelet: k8s.io/kubernetes/vendor/github.com/containernetworking/cni/pkg/invoke.(*PluginExec).WithResult()
kubelet: k8s.io/kubernetes/vendor/github.com/containernetworking/cni/pkg/invoke.ExecPluginWithResult()
kubelet: k8s.io/kubernetes/vendor/github.com/containernetworking/cni/libcni.(*CNIConfig).AddNetworkList()
kubelet: k8s.io/kubernetes/pkg/kubelet/dockershim/network/cni.(*cniNetworkPlugin).addToNetwork()
kubelet: k8s.io/kubernetes/pkg/kubelet/dockershim/network/cni.(*cniNetworkPlugin).SetUpPod()
kubelet: k8s.io/kubernetes/pkg/kubelet/dockershim/network.(*PluginManager).SetUpPod()
kubelet: k8s.io/kubernetes/pkg/kubelet/dockershim.(*dockerService).RunPodSandbox()
kubelet: k8s.io/kubernetes/pkg/kubelet/apis/cri/runtime/v1alpha2._RuntimeService_RunPodSandbox_Handler()
kubelet: k8s.io/kubernetes/vendor/google.golang.org/grpc.(*Server).processUnaryRPC()
kubelet: k8s.io/kubernetes/vendor/google.golang.org/grpc.(*Server).handleStream()
kubelet: k8s.io/kubernetes/vendor/google.golang.org/grpc.(*Server).serveStreams.func1.1()

Unresponsive Terwayd

Before explaining this issue further, we need to distinguish between Terway and Terwayd. Essentially, Terway and Terwayd are client-server relationships, which is the same as the relationship between flannel and flanneld. Terway is a plug-in that implements the cni interface according to the definition of kubelet.

6.png

At the end of the previous section, the problem we saw was that when kubelet called CNI terway to configure the pod network, Terway did not respond for a long time. Under normal circumstances, this operation should be second-level, very fast. When there was a problem, Terway did not complete the task normally, so we saw a large number of terway processes accumulating on the cluster nodes.

7.png

Similarly, we can send SIGABRT to these terway plug-in processes to print out the call stack of the process. Below is one of the call stacks of terway. This thread is executing the cmdDel function, its role is to delete a pod network related configuration.

kubelet: net/rpc.(*Client).Call()
kubelet: main.rpcCall()kubelet: main.cmdDel()
kubelet: github.com/AliyunContainerService/terway/vendor/github.com/containernetworking/cni/pkg/skel.(*dispatcher).checkVersionAndCall()
kubelet: github.com/AliyunContainerService/terway/vendor/github.com/containernetworking/cni/pkg/skel.(*dispatcher).pluginMain()
kubelet: github.com/AliyunContainerService/terway/vendor/github.com/containernetworking/cni/pkg/skel.PluginMainWithError()
kubelet: github.com/AliyunContainerService/terway/vendor/github.com/containernetworking/cni/pkg/skel.PluginMain()

The above thread calls terwayd through rpc to actually remove the pod network. So we need to further investigate terwayd's call stack to further locate this problem. As the server side of Terway, Terwayd accepts remote calls from Terway, and completes its cmdAdd or cmdDel for Terway to create or remove pod network configuration.

We can see in the screenshot above that there are thousands of Terway processes on the cluster nodes, and they are all waiting for Terwayd, so in fact, there are also thousands of threads in Terwayd processing Terway requests.

Use the following command to output the call stack without restarting Terwayd.

curl  --unix-socket /var/run/eni/eni.socket 'http:/debug/pprof/goroutine?debug=2'

Because Terwayd's call stack is very complicated, and almost all threads are waiting for the lock, it is more complicated to directly analyze the lock waiting relationship. At this time, we can use the "time method", that is, assuming the earliest thread into the waiting state, the probability is the thread holding the lock.

After the call stack and code analysis, we found that the following is the thread with the longest waiting time (1595 minutes) and holding the lock. This lock blocks all threads that create or destroy pod networks.

goroutine 67570 [syscall, 1595 minutes, locked to thread]:
syscall.Syscall6()
github.com/AliyunContainerService/terway/vendor/golang.org/x/sys/unix.recvfrom()
github.com/AliyunContainerService/terway/vendor/golang.org/x/sys/unix.Recvfrom()
github.com/AliyunContainerService/terway/vendor/github.com/vishvananda/netlink/nl.(*NetlinkSocket).Receive()
github.com/AliyunContainerService/terway/vendor/github.com/vishvananda/netlink/nl.(*NetlinkRequest).Execute()
github.com/AliyunContainerService/terway/vendor/github.com/vishvananda/netlink.(*Handle).LinkSetNsFd()
github.com/AliyunContainerService/terway/vendor/github.com/vishvananda/netlink.LinkSetNsFd()
github.com/AliyunContainerService/terway/daemon.SetupVethPair()github.com/AliyunContainerService/terway/daemon.setupContainerVeth.func1()
github.com/AliyunContainerService/terway/vendor/github.com/containernetworking/plugins/pkg/ns.(*netNS).Do.func1()
github.com/AliyunContainerService/terway/vendor/github.com/containernetworking/plugins/pkg/ns.(*netNS).Do.func2()

For a deep analysis of the call stack of the previous thread, we can determine three things.

  • First, Terwayd uses the netlink library to manage virtual network cards, IP addresses and routing resources on the nodes, and netlink implements functions similar to iproute2;
  • Second, netlink uses sockets to communicate directly with the kernel;
  • Third, the above threads wait on the recvfrom system call.

In this case, we need to check the kernel call stack of this thread to further confirm the reason for this thread waiting. Because it is relatively difficult to find the system thread id of this thread from the goroutine thread number, here we find the kernel call stack of the upper thread by grabbing the core dump of the system.

In the kernel call stack, search for recvfrom and locate the thread below. Basically from the call stack below, we can only be sure that this thread is waiting on the recvfrom function.

PID: 19246  TASK: ffff880951f70fd0  CPU: 16  COMMAND: "terwayd" 
#0 [ffff880826267a40] __schedule at ffffffff816a8f65 
#1 [ffff880826267aa8] schedule at ffffffff816a94e9 
#2 [ffff880826267ab8] schedule_timeout at ffffffff816a6ff9 
#3 [ffff880826267b68] __skb_wait_for_more_packets at ffffffff81578f80 
#4 [ffff880826267bd0] __skb_recv_datagram at ffffffff8157935f 
#5 [ffff880826267c38] skb_recv_datagram at ffffffff81579403 
#6 [ffff880826267c58] netlink_recvmsg at ffffffff815bb312 
#7 [ffff880826267ce8] sock_recvmsg at ffffffff8156a88f 
#8 [ffff880826267e58] SYSC_recvfrom at ffffffff8156aa08 
#9 [ffff880826267f70] sys_recvfrom at ffffffff8156b2fe
#10 [ffff880826267f80] tracesys at ffffffff816b5212 
(via system_call)

It is more difficult to investigate this problem in depth, which is obviously a kernel problem or a kernel-related problem. We went through the entire kernel core and checked all the thread call stacks, and we couldn't see other threads that might be related to this problem.

repair

The fix for this problem is based on the assumption that netlink is not 100% reliable. Netlink can respond very slowly, or not at all. So we can increase the timeout for netlink operations, so that even if a certain netlink call cannot be completed, terwayd will not be blocked.

to sum up

In the scenario where the node is ready, kubelet actually implements the node's heartbeat mechanism. Kubelet will periodically synchronize the various statuses related to the node, including memory, PID, disk, and of course, the ready state concerned in this article, to the cluster management and control. In the process of monitoring or managing cluster nodes, kubelet uses various plug-ins to directly manipulate node resources. This includes plug-ins such as network, disk, and even container runtime. The status of these plug-ins will directly apply the status of kubelets or even nodes.

<Follow the public account, reply to the investigation and download the book>

" Alibaba Cloud Native focuses on microservices, serverless, containers, Service Mesh and other technical fields, focusing on cloud native popular technology trends, cloud native large-scale landing practices, and being the technology circle that understands cloud native developers best."

Guess you like

Origin www.cnblogs.com/alisystemsoftware/p/12722092.html