Source code analysis of kubelet's expulsion mechanism

1 Overview:

1.1 Source code environment

The version information is as follows:
a. kubernetes cluster: v1.15.0

1.2 Two lines of defense to maintain node stability

For the Linux operating system, CPU time slice, memory, disk capacity, inode, PID, etc. are all system resources.

A classification of system resources:
a, compressible resources (CPU)
b, incompressible resources (memory, disk capacity and inode, PID)

As a node agent, kubelet naturally needs a certain mechanism to ensure that server resources will not be exhausted. When the compressible resources are not enough, the Pod (container or process) will be starved, but it will not be forced to exit by the operating system and kubelet. When incompressible resources (such as memory) are not enough, the kubelet level can reclaim some server resources in advance (delete unused images, clean up exited containers and failed Pods, kill running Pods), and the OOM of the Linux operating system As the last line of defense, KILLER also has the opportunity to directly kill the user mode process.

The kubelet kills the running Pod. This process is called eviction. Kubelet deletes unused images, cleans up retired containers, and failed Pods. This process is called recycling node-level resources.

Therefore, there are two lines of defense to maintain the stability of the node. The first line of defense is the user process kubelet, and the second line of defense is Linux OOM KILLER.


1.3 kubelet's mechanism to maintain node stability

When incompressible resources (memory, disk capacity and inode, PID) are not used enough, it will seriously affect the stability of the node. For this reason, kubelet, as the node agent of the Linux user mode process and k8s cluster, should have a certain mechanism to maintain the stability of the node. The essence is to achieve the goal by deleting the image and stopping the container process.


1.3.1 Resource threshold, users have the final say

That kubelet needs to obtain the overall information of node resources on a regular basis, which is inevitable.
How many resources are left on the server to be considered incompressible resources is "not enough"? This "not enough" basis is specified by the user in the kubelet configuration file.
As long as the kubelet obtains the capacity of the node resources, the actual usage (rate) and the threshold configured by the user, it feels that the "conditions are met" can start to delete the image and stop the container process.


1.3.2 Soft and hard eviction

"The conditions are met", wait for zero seconds to start deleting the image and stop the container process, this is a hard eviction.
"The conditions are met", wait for N (N> 0) seconds to start the operation of deleting the image and stopping the container process, which is soft eviction. The number N can be set by the parameter -eviction-soft-grace-period of kubelet. In addition, for soft eviction, the --eviction-max-pod-grace-period parameter affects the time the eviction manager waits for kubelet to clean up the Pod (the real waiting time is eviction-max-pod-grace-period+ (eviction-max-pod- grace-period/ 2) ).


1.3.3 Block Pod creation

When the node's incompressible resources are "not enough", kubelet must also have a mechanism to block the creation of Pod, which is related to the Admit(...) method of the expulsion manager.

1.3.4 Affect Linux OOM KILLER

Kubelet is the first line of defense to maintain the stability of the server. If you can't figure it out, then the second line of defense, Linux OOM KILLER, must be the second line of defense. The correlation between the first line of defense and the second line of defense is oom_score_adj.
Pod’s Qos is different, kubelet configures oom_score_adj for different scores, and the process’s oom_score_adj is a factor that affects Linux OOM KILLER's killing of user-mode processes.

Insert picture description here

Other instructions:
1) Command template to view the oom_score_adj score of the process of the container in the Pod:
kubectl exec demo-qos-pod cat /proc/1/oom_score_adj


1.4 Overview of Linux OOM KILLER

1) Linux responds "yes" to most requests for memory application, so that more and larger programs can be run. Because after applying for the memory, the memory will not be used immediately. This technique is called Overcommit.

2) When Linux finds that the memory is insufficient, OOM killer (OOM=out-of-memory) will occur. At this time, it will choose to kill some processes (user mode processes, not kernel threads) in order to release memory.

3) When oom-killer occurs, which processes will linux choose to kill? Linux will calculate the number of points (0~1000) for each process. The higher the number of points, the more likely this process is to be killed.
The formula of oom points:

进程OOM点数 = oom_score + oom_score_adj 

#oom_score is related to the memory consumed by the process.
#oom_score_adj is configurable, and the value range is -1000 is the lowest and 1000 is the highest.

4) The oom log is in the /var/log/messages file, which belongs to the kernel log. You can view the kernel log through the grep kernel /var/log/messages command.
Insert picture description here


2 kubelet expulsion source code analysis:

2.1 Key methods

																				|-> kl.cadvisor.Start()  // 启动cadvisor,cadvisor能获取节点信息和容器信息
																				|
		  //启动							//初始化额外模块,只会运行一次		        |
Kubelet -> Run() -> updateRuntimeUp()  ->  initializeRuntimeDependentModules -->|-> kl.containerManager.Start()  //containerManager需要cAdvisor提供的文件系统信息
																				|
																				|
																				|	//evictionManager需要通过cadvisor得知容器运行时是否具备一个独立专用的imagefs
																				|
																				|-> kl.evictionManager.Start() -> 定时任务 ->| -> m.synchronize() -> | -> m.reclaimNodeLevelResources(...)  //清理节点级别的资源(镜像、已退出的容器)
																															|					    |
																															| 	                    | -> m.evictPod(...)  //清理pod对象
																															|
																															| -> m.waitForPodsCleanup(...)   //如果m.synchronize()驱逐了pod,执行本方法进行等待pod被清理.

2.2 Supported signals that can trigger the eviction of Pod

When memory, disk capacity, disk inode number, and pid number are insufficient, kubelet can be triggered to clean up images and containers to achieve the purpose of reclaiming node resources.

// Signal defines a signal that can trigger eviction of pods on a node.
type Signal string

/*
分类:
1)内存
2)磁盘容量
3)磁盘inode数量
4)pid数量
*/
const (
	// SignalMemoryAvailable is memory available (i.e. capacity - workingSet), in bytes.
	SignalMemoryAvailable Signal = "memory.available"
	
	// SignalNodeFsAvailable is amount of storage available on filesystem that kubelet uses for volumes, daemon logs, etc.
	// nodefs是kubelet使用的卷、进程日志所在的文件系统
	SignalNodeFsAvailable Signal = "nodefs.available"
	
	// SignalNodeFsInodesFree is amount of inodes available on filesystem that kubelet uses for volumes, daemon logs, etc.
	SignalNodeFsInodesFree Signal = "nodefs.inodesFree"
	
	// SignalImageFsAvailable is amount of storage available on filesystem that container runtime uses for storing images and container writable layers.
	// imagefs是容器运行时(例如docker)的镜像层、读写层所在的文件系统
	SignalImageFsAvailable Signal = "imagefs.available"
	
	// SignalImageFsInodesFree is amount of inodes available on filesystem that container runtime uses for storing images and container writable layers.
	SignalImageFsInodesFree Signal = "imagefs.inodesFree"
	
	// SignalAllocatableMemoryAvailable is amount of memory available for pod allocation (i.e. allocatable - workingSet (of pods), in bytes.
	SignalAllocatableMemoryAvailable Signal = "allocatableMemory.available"
	
	// SignalPIDAvailable is amount of PID available for pod allocation
	SignalPIDAvailable Signal = "pid.available"
)

2.3 Overall code

//kubelet的启动入口
func (kl *Kubelet) Run(updates <-chan kubetypes.PodUpdate) {
	/*
		其他代码
	*/
	
	go wait.Until(kl.updateRuntimeUp, 5*time.Second, wait.NeverStop)
	
}


// 当容器运行时第一次运行,初始化一些它依赖的模块
func (kl *Kubelet) updateRuntimeUp() {
	/*
		其他代码
	*/
	
	kl.oneTimeInitializer.Do(kl.initializeRuntimeDependentModules)

}

//启动cadvisor、containerManager和本文的核心研究对象evictionManager
func (kl *Kubelet) initializeRuntimeDependentModules() {

	// 启动cadvisor
	if err := kl.cadvisor.Start(); err != nil {		
	}
	
	/*
		其他代码
	*/	
	
	// containerManager必须在cAdvisor之后启动,因为它需要cAdvisor提供的文件系统信息
	if err := kl.containerManager.Start(node, kl.GetActivePods, kl.sourcesReady, kl.statusManager, kl.runtimeService); err != nil {		
	}
	
	//evictionManager是本文的核心
	// evictionManager必须在cAdvisor之后启动,因为它需要知道容器运行时是否具备一个独立专用的imagefs
	kl.evictionManager.Start(kl.StatsProvider, kl.GetActivePods, kl.podResourcesAreReclaimed, evictionMonitoringPeriod)
	
	/*
		其他代码
	*/	
}

The main attributes of the structure of the eviction manager

// managerImpl implements Manager
type managerImpl struct {
	
	//kubelet的配置文件中和驱逐相关的字段会赋值到本属性
	config Config
	
	//杀死Pod的方法
	//--eviction-max-pod-grace-period参数和此方法相关
	killPodFunc KillPodFunc
	
	
	// the interface that knows how to do image gc
	imageGC ImageGC
	// the interface that knows how to do container gc
	containerGC ContainerGC
	// protects access to internal state
	
	// 达到阈值的不可压缩资源,就会出现在本属性
	thresholdsMet []evictionapi.Threshold
	
	// 对Pod进行排序的方法保存在此属性
	signalToRankFunc map[evictionapi.Signal]rankFunc
	// signalToNodeReclaimFuncs maps a resource to an ordered list of functions that know how to reclaim that resource.
	
	thresholdNotifiers []ThresholdNotifier

}	

2.4 How to start the eviction manager

func (m *managerImpl) Start(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc, podCleanedUpFunc PodCleanedUpFunc, monitoringInterval time.Duration) {
	/*
	其他代码
	*/
	
	//通过for循环和睡眠来定时执行m.synchronize()方法
	go func() {
		for {
		    //m.synchronize()方法内部有清理 节点资源 和 Pod 的逻辑
			if evictedPods := m.synchronize(diskInfoProvider, podFunc); evictedPods != nil {
				m.waitForPodsCleanup(podCleanedUpFunc, evictedPods)
			} else {
			    //睡眠
				time.Sleep(monitoringInterval)
			}
		}
	}()
}

2.4 The main logic of the expulsion manager to reclaim resources


func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
		
	/*
		填充m.signalToRankFunc的值,也就是pod排序的排序函数,整个过程只执行一次;
	*/
	
	// 获取节点上运行中的pod
	activePods := podFunc()
	// 从cadvisor获取节点的详细信息,并进行统计
	summary, err := m.summaryProvider.Get(true)
	//再从统计数据中获得节点资源的使用情况observations
	observations, statsFunc := makeSignalObservations(summary)
	
	// 将资源实际使用量和资源容量进行比较,最终得到阈值结构体对象的列表
	thresholds = thresholdsMet(thresholds, observations, false)	
	thresholdsFirstObservedAt := thresholdsFirstObservedAt(thresholds, m.thresholdsFirstObservedAt, now)
	//此时的thresholds是即将触发回收资源的thresholds,因为这些thresholds已经过了平滑时间
	thresholds = thresholdsMetGracePeriod(thresholdsFirstObservedAt, now)

		
	/*
		获取nodeConditions;
		nodeConditions是一个字符串切片;
		打印一些和nodeConditions相关的日志信息;	
	*/
	
	/*
		更新驱逐管理器(即本方法的接收者)的一些成员变量的值
		
		m.thresholdsMet = thresholds
		
		//m.nodeConditions过了--eviction-pressure-transition-period指定的时间,也会更新到kubelet本地的node对象中,而node对象最终也会被kubelet同步到kube-apiserver中
		//m.nodeConditions也和m.Admit(...)方法相关,m.Admit(...)方法决定是否允许在当前节点上创建或运行目标pod
		m.nodeConditions = nodeConditions
		
	*/
	
	// 本地临时存储导致的驱逐(一定是驱逐pod),如果发生,后续的回收资源(清理节点资源和运行中的pod)的操作不会执行,直接返回
	if utilfeature.DefaultFeatureGate.Enabled(features.LocalStorageCapacityIsolation) {
		if evictedPods := m.localStorageEviction(summary, activePods); len(evictedPods) > 0 {
			return evictedPods
		}
	}
	
	// 节点资源使用量没有到达用户配置的阈值,不需要回收节点资源,因此直接返回
	if len(thresholds) == 0 {
		klog.V(3).Infof("eviction manager: no resources are starved")
		return nil
	}
	
	// 回收节点级的资源,如果回收的资源足够的话,直接返回,不需要驱逐正在运行中的pod
	if m.reclaimNodeLevelResources(thresholdToReclaim.Signal, resourceToReclaim) {
		klog.Infof("eviction manager: able to reduce %v pressure without evicting pods.", resourceToReclaim)
		return nil
	}

	// 根据资源类型,从一个map中拿到一个排序方法,使用该排序方法对运行中的pod进行排序
	rank, ok := m.signalToRankFunc[thresholdToReclaim.Signal]	
	rank(activePods, statsFunc)
	// 此时activePods已经按照一定规则进行排序
	klog.Infof("eviction manager: pods ranked for eviction: %s", format.Pods(activePods))
	
	// 开始对运行中的pod实施驱逐操作
	for i := range activePods {	
		//如果驱逐了一个pod,则立马返回,因为一个周期最多驱逐一个pod
		if m.evictPod(pod, gracePeriodOverride, message, annotations) {
			return []*v1.Pod{pod}
		}
	}
	
	// 到达此处,说明本周期试图驱逐pod,但连一个pod都没驱逐成功
	klog.Infof("eviction manager: unable to evict any pods from the node")
	return nil
}

2.4 The signalToRankFunc property of the eviction manager

It holds a set of methods for sorting Pods.

func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
	/*
	其他代码
	*/
	
	if m.dedicatedImageFs == nil {
		hasImageFs, ok := diskInfoProvider.HasDedicatedImageFs()
		if ok != nil {
			return nil
		}
		//dedicatedImageFs被赋值后,不会再进入当前整个if语句
		m.dedicatedImageFs = &hasImageFs
		//把pod的排序方法作为map传给驱逐管理器的signalToRankFunc属性
		m.signalToRankFunc = buildSignalToRankFunc(hasImageFs)   
	}
	
	/*
	其他代码
	*/
}
func buildSignalToRankFunc(withImageFs bool) map[evictionapi.Signal]rankFunc {

	//不管入参是true还是false,内存、pid这两种资源的排序方法是不变的
	signalToRankFunc := map[evictionapi.Signal]rankFunc{
		evictionapi.SignalMemoryAvailable:            rankMemoryPressure,
		evictionapi.SignalAllocatableMemoryAvailable: rankMemoryPressure,
		evictionapi.SignalPIDAvailable:               rankPIDPressure,
	}
	
	//虽然都是rankDiskPressureFunc,但方法的入参不同。
	if withImageFs {
		
		// 使用独立的imagefs,那么nodefs的排序方法只包含日志和本地卷
		signalToRankFunc[evictionapi.SignalNodeFsAvailable] = rankDiskPressureFunc([]fsStatsType{fsStatsLogs, fsStatsLocalVolumeSource}, v1.ResourceEphemeralStorage)
		signalToRankFunc[evictionapi.SignalNodeFsInodesFree] = rankDiskPressureFunc([]fsStatsType{fsStatsLogs, fsStatsLocalVolumeSource}, resourceInodes)
			
		// 使用独立的imagefs,那么imagefs的排序方法只包含rootfs
		signalToRankFunc[evictionapi.SignalImageFsAvailable] = rankDiskPressureFunc([]fsStatsType{fsStatsRoot}, v1.ResourceEphemeralStorage)
		signalToRankFunc[evictionapi.SignalImageFsInodesFree] = rankDiskPressureFunc([]fsStatsType{fsStatsRoot}, resourceInodes)
	} else {	
		// 不使用独立的imagefs,nodefs的排序方法是包含所有文件系统,即fsStatsRoot, fsStatsLogs, fsStatsLocalVolumeSource
		// 此时imagefs和nodefs在使用同一个块设备,因此它们的排序方法都是一样的。
		signalToRankFunc[evictionapi.SignalNodeFsAvailable] = rankDiskPressureFunc([]fsStatsType{fsStatsRoot, fsStatsLogs, fsStatsLocalVolumeSource}, v1.ResourceEphemeralStorage)
		signalToRankFunc[evictionapi.SignalNodeFsInodesFree] = rankDiskPressureFunc([]fsStatsType{fsStatsRoot, fsStatsLogs, fsStatsLocalVolumeSource}, resourceInodes)
		signalToRankFunc[evictionapi.SignalImageFsAvailable] = rankDiskPressureFunc([]fsStatsType{fsStatsRoot, fsStatsLogs, fsStatsLocalVolumeSource}, v1.ResourceEphemeralStorage)
		signalToRankFunc[evictionapi.SignalImageFsInodesFree] = rankDiskPressureFunc([]fsStatsType{fsStatsRoot, fsStatsLogs, fsStatsLocalVolumeSource}, resourceInodes)
	}
	return signalToRankFunc
}

2.4 Method of sorting Pod

When the expulsion manager expels the Pod, it will sort the running Pods, and the Pod at the top of the queue will be cleaned up first.

The Pod with Qos as Guaranteed is the last in the queue, because during normal operation, the difference between the usage and the request must be a negative number, and of course it is at the back of the queue. When the resource usage of this type of Pod reaches the request, that is, when the limit is reached, linux detects and compares the value set by the limit (in the cgroup file), and sends a kill signal to the process in the Pod. It is not a running Pod at this time.

And the Pod with Qos as Best-Effort is the top, because the request is not set, then the difference between the usage and the request must be a positive number, so it must be the top part of the queue.


2.4.1 How to sort Pods when memory is tight

/*
当内存使用量达到阈值时,对入参pods进行排序。
首先看使用量是否超过pod的request字段
接着看pod的QOS类型
最后看使用量和request之间的差值
*/

func rankMemoryPressure(pods []*v1.Pod, stats statsFunc) {

	orderedBy(
		exceedMemoryRequests(stats),  //使用量是否超过pod的request字段
		priority,  //pod的QOS类型
		memory(stats)  //使用量和request之间的差值
	).Sort(pods) 
	
}

2.4.2 How to sort Pods when Pid is nervous

/*
当pid用量达到阈值时,根据pod的QOS类型,对入参pods进行排序。
*/
func rankPIDPressure(pods []*v1.Pod, stats statsFunc) {
	orderedBy(priority).Sort(pods)
}

2.4.3 How to sort Pods when the disk is tight

/*
当磁盘使用量达到阈值时,对入参pods进行排序。
首先看使用量是否超过pod的request字段
接着看pod的QOS类型
最后看使用量和request之间的差值
*/
func rankDiskPressureFunc(fsStatsToMeasure []fsStatsType, diskResource v1.ResourceName) rankFunc {
	return func(pods []*v1.Pod, stats statsFunc) {
	
		orderedBy(
			exceedDiskRequests(stats, fsStatsToMeasure, diskResource),	//使用量是否超过pod的request字段
			priority,  //pod的QOS类型
			disk(stats, fsStatsToMeasure, diskResource)   //使用量和request之间的差值
		).Sort(pods) 
		
	}
}

2.4 Other methods

2.4.1 func (m *managerImpl) Admit(…)

The Admit(...) method of the eviction manager will affect whether the Pod is allowed to be created.

/*
1)入参attrs对象中包含一个pod对象
2)m.nodeConditions字段通过本方法和kubelet的其他流程有关联:
  2.1)HandlePodAdditions(...)方法中会间接调用本方法,从而达到不创建目标pod的效果。
  2.2)syncPod(...)方法中会间接调用本方法,从而到达目标pod(如果存在的话)会被删除的效果。
*/

func (m *managerImpl) Admit(attrs *lifecycle.PodAdmitAttributes) lifecycle.PodAdmitResult {
	m.RLock()
	defer m.RUnlock()
	
	//节点无资源压力,则允许创建任意pod
	if len(m.nodeConditions) == 0 {
		return lifecycle.PodAdmitResult{Admit: true}
	}
	
	//Critical pods也能被允许创建
	if kubelettypes.IsCriticalPod(attrs.Pod) {
		return lifecycle.PodAdmitResult{Admit: true}
	}
	
	if hasNodeCondition(m.nodeConditions, v1.NodeMemoryPressure) {
		notBestEffort := v1.PodQOSBestEffort != v1qos.GetPodQOS(attrs.Pod)
		
		// 节点处于内存压力状态,best-effort类型的pod不能被允许创建,其余类型的pod可以
		if notBestEffort {
			return lifecycle.PodAdmitResult{Admit: true}
		}

		// 节点处于内存压力状态并且开启TaintNodesByCondition特性,pod能容忍内存压力污点的话,它也能被允许创建
		if utilfeature.DefaultFeatureGate.Enabled(features.TaintNodesByCondition) &&
			v1helper.TolerationsTolerateTaint(attrs.Pod.Spec.Tolerations, &v1.Taint{
				Key:    schedulerapi.TaintNodeMemoryPressure,
				Effect: v1.TaintEffectNoSchedule,
			}) {
			return lifecycle.PodAdmitResult{Admit: true}
		}
	}

	// 来到此处,说明节点是处于磁盘压力状态,或者处于内存压力状态但入参的pod是bestEffort类型,此时不允许创建pod	
	klog.Warningf("Failed to admit pod %s - node has conditions: %v", format.Pod(attrs.Pod), m.nodeConditions)
	return lifecycle.PodAdmitResult{
		Admit:   false,
		Reason:  Reason,
		Message: fmt.Sprintf(nodeConditionMessageFmt, m.nodeConditions),
	}
}

3 The experience of Pod being expelled

3.1 After the Pods of the ds and sts controllers are expelled, the controller will not create new Pods.

The expelled Pod still occupies a record, and the main loop logic of the sts controller and ds controller will not create a new Pod.
Therefore, it is best to execute a crontab task (kubectl delete pod --all-namespaces --field-selector='status.phase==Failed') on the master node, and periodically delete the failed Pods (including the expelled Pods) .

After the two ds pods in the screenshot were expelled, because the controller did not create a new Pod, the container communication problem of the node was caused, so the expelled ds and sts services will not heal themselves.
The coredns pod in the screenshot is a pod managed by the deployment controller. After the old pod is evicted, the controller will create a new pod, so the dns service inside the cluster will self-heal.
Insert picture description here


4 summary

It is recognized that node resources can be divided into compressible resources and incompressible resources. When incompressible resources are in a state of scarcity, this kind of scarcity will lead to instability of the node.
As a node agent, the user process kubelet is the first line of defense to maintain the stability of the node. It will periodically check the node resource status, the status of the actual resource used by the Pod, and the resource threshold configured by the user, and finally perform the operation of cleaning up the image and container to maintain The stability of the node, after all, the stability of the node is greater than the stability of the container.
The second line of defense is Linux OOM KILLER, which is also the bottom line of defense.
There will be a kind of collaboration between the first line of defense and the second line of defense, which is oom_score_adj. When kubelet creates a container, oom_score_adj can be set to affect the Linux OOM KILLER kill process.

Guess you like

Origin blog.csdn.net/nangonghen/article/details/109696462