Lessons learned about Kubernetes garbage collection

Some time ago, I learned an important lesson about Kubernetes. The story starts with Kubernetes Operators, which is a way to package, deploy, and manage Kubernetes applications. The mistake I made is the garbage collection in the cluster-it will clean up objects that no longer have owner objects (more on this later).             

task              

Last year, my team was assigned to develop a Kubernetes Operator. For most people on the team, this is their first experience of Operator SDK (software development kit) and Kubernetes controller (Kubernetes control loop).

We read some basic introductory information about the Operator SDK and followed the quick start guide for building an Operator in the Go programming language. We also learned some basic principles and some simple techniques.             

Our task is to develop an Operator that can be installed, configured, and ensure production readiness for several projects. Our goal is to automate the management of a set of instances, requiring only minimal manual operations by the Site Reliability Engineering (SRE) team. This is not an easy task.             

Bug is coming

Initially, we pursued a proof-of-concept implementation, so we recorded some bugs and planned to fix them later.             

 

 

Our bug is not urgent but important, in fact it is very important-all namespaces created by Operator sometimes terminate without any request from the user. This situation does not happen often, so we decided to solve it later.             

Finally, the author found this bug from the backlog and started to look for the root cause. Operator certainly cannot terminate the namespace because there was no Delete API call in the code at the time. In hindsight, this was the first clue. I check the logs on the Kubernetes API server and make sure that the logs are stored safely, and then wait for the problem to happen again. 

After a problem occurred in the set environment, the author searched the log for the combination of the following strings: "requestURI": "/api/v1/namespaces/my namespace" + "verb": "delete".             

From the search results, the author found the content that is performing namespace deletion:

“user”:{“username”:"system:serviceaccount:kube-system:generic-garbage-collector".              

Now I know how the namespace is deleted, but I don’t know why. The author opened the Kubernetes garbage collection document, browsed the basic information of the ownerReference field, and thought about why this happened.             

We used the ownerReference metadata field on the created namespace. The owner is our own resource defined by the custom resource API. When our custom resource is deleted, its related namespaces owned by ownerReference are also deleted. Deleting the associated objects makes the uninstallation step a breeze.             

The author thinks this is not a problem, so continue to read the log to find more clues. The author noticed that when the kube-controller-manager pod restarts, the namespace will be deleted. The reason for the restart is very meaningful to the author: the kube-controller-manager pod runs on the master node, and our development cluster has only one master node. For the instance size we use, the load on this node is very high. .             

So I tried to reproduce this problem myself. The author deleted the kube-controller-manager pod, a new one appeared, and the author checked its log. Once I saw some logs about garbage collection, and finally wanted to understand it, and then went back to look at the garbage collection document. Sure enough, "Note: cross-namespace owner reference is not allowed. This means: 1) namespace-scope dependents can only specify the owner in the same namespace, and cluster-scope owner. 2) Cluster-scope dependents can only be specified The owner of the cluster scope, but the owner of the namespace scope cannot be specified."              

Our custom resource is namespace-scoped, but the namespace is cluster-scoped. Even if the owner reference we use is banned, the Kubernetes API server will create a namespace. Therefore, if the namespace is created with an owner reference, it must be deleted.             

Lessons learned              

The technical lesson I learned is simple: don't use the owner reference in which namespace-scoped resources own cluster-scoped resources or resources in other namespaces. When you use these "disallowed by design" owner references, as long as the kube controller manager pod is activated, the garbage collection routine will delete your Kubernetes resources.

The more important lesson is not to underestimate the documentation. If the author was more patient when looking at the document for the first time, it would definitely save some time.             

You might think that this situation could have been avoided if you followed the author's advice when adding an invalid owner reference to the code base. But this is not documented. Only a pull request in February 2019 added a note about this-which means that there is always room for improvement in the documentation.

Original link:

https://opensource.com/article/20/6/kubernetes-garbage-collection

 

Guess you like

Origin blog.csdn.net/k8scaptain/article/details/106851365
Recommended