【kubernetes/k8s源码分析】eviction机制原理以及源码解析

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/zhonglinzhang/article/details/84245188

What?

Why?

  kubelet通过OOM Killer来回收缺点:

  • System OOM events会保存记录直到完成了OOM
  • OOM Killer干掉containers后,Scheduler可能又会调度新的Pod到该Node上或者直接在node上重新运行,又会触发该Node上的OOM Killer,可能无限循化这种操作

How?

  kubelet启动eviction默认值

--eviction-hard="imagefs.available<15%,memory.available<100Mi,nodefs.available<10%,nodefs.inodesFree<5%"

--eviction-max-pod-grace-period="0"

--eviction-minimum-reclaim=""

--eviction-pressure-transition-period="5m0s"

--eviction-soft=""

--eviction-soft-grace-period=""

      注意:分为eviction-soft和eviction-hard。soft到达threshold值时会给pod一段时间优雅退出,而hard直接杀掉pod,不给任何优雅退出的机会

  eviction singal

  • memory.available
  • nodefs.available
  • nodefs.inodesFree
  • imagefs.available
  • imagefs.inodesFree
  • allocatableMemory.available

注意:

  • nodefs: 指node自身的存储,存储运行日志等
  • imagefs: 指dockerd存储image和容器可写层

managerImpl结构体

  • killPodFunc: 赋值为killPodNow方法
  • imageGC: 出现diskPressure时,imageGC进行删除未使用的镜像
  • thresholdsFirstObservedAt : 记录threshold第一次观察到的时间
  • resourceToRankFunc - 定义各种Resource进行evict 挑选时的排名方法。
  • nodeConditionsLastObservedAt: 上一次获取的eviction signal的记录
  • notifierInitialized - bool值,表示threshold notifier是否已经初始化,以确定是否可以利用kernel memcg notification功能来提高evict的响应速度。目前创建manager时该值为false,是否要利用kernel memcg notification,完全取决于kubelet的--experimental-kernel-memcg-notification参数。
// managerImpl implements Manager
type managerImpl struct {
	//  used to track time
	clock clock.Clock
	// config is how the manager is configured
	config Config
	// the function to invoke to kill a pod
	killPodFunc KillPodFunc
	// the interface that knows how to do image gc
	imageGC ImageGC
	// the interface that knows how to do container gc
	containerGC ContainerGC
	// protects access to internal state
	sync.RWMutex
	// node conditions are the set of conditions present
	nodeConditions []v1.NodeConditionType
	// captures when a node condition was last observed based on a threshold being met
	nodeConditionsLastObservedAt nodeConditionsObservedAt
	// nodeRef is a reference to the node
	nodeRef *v1.ObjectReference
	// used to record events about the node
	recorder record.EventRecorder
	// used to measure usage stats on system
	summaryProvider stats.SummaryProvider
	// records when a threshold was first observed
	thresholdsFirstObservedAt thresholdsObservedAt
	// records the set of thresholds that have been met (including graceperiod) but not yet resolved
	thresholdsMet []evictionapi.Threshold
	// signalToRankFunc maps a resource to ranking function for that resource.
	signalToRankFunc map[evictionapi.Signal]rankFunc
	// signalToNodeReclaimFuncs maps a resource to an ordered list of functions that know how to reclaim that resource.
	signalToNodeReclaimFuncs map[evictionapi.Signal]nodeReclaimFuncs
	// last observations from synchronize
	lastObservations signalObservations
	// dedicatedImageFs indicates if imagefs is on a separate device from the rootfs
	dedicatedImageFs *bool
	// thresholdNotifiers is a list of memory threshold notifiers which each notify for a memory eviction threshold
	thresholdNotifiers []ThresholdNotifier
	// thresholdsLastUpdated is the last time the thresholdNotifiers were updated.
	thresholdsLastUpdated time.Time
}

1. eviction manager初始化

  路径: pkg/kubelet/kubelet.go

  1.1 eviction 配置参数

     可以参照上面kubelet启动eviction默认值

	thresholds, err := eviction.ParseThresholdConfig(enforceNodeAllocatable, kubeCfg.EvictionHard, kubeCfg.EvictionSoft, kubeCfg.EvictionSoftGracePeriod, kubeCfg.EvictionMinimumReclaim)
	if err != nil {
		return nil, err
	}
	evictionConfig := eviction.Config{
		PressureTransitionPeriod: kubeCfg.EvictionPressureTransitionPeriod.Duration,
		MaxPodGracePeriodSeconds: int64(kubeCfg.EvictionMaxPodGracePeriod),
		Thresholds:               thresholds,
		KernelMemcgNotification:  experimentalKernelMemcgNotification,
		PodCgroupRoot:            kubeDeps.ContainerManager.GetPodCgroupRoot(),
	}

  1.2 初始化eviction manager

	// setup eviction manager
	evictionManager, evictionAdmitHandler := eviction.NewManager(klet.resourceAnalyzer, evictionConfig, killPodNow(klet.podWorkers, kubeDeps.Recorder), klet.imageManager, klet.containerGC, kubeDeps.Recorder, nodeRef, klet.clock)

	klet.evictionManager = evictionManager
	klet.admitHandlers.AddPodAdmitHandler(evictionAdmitHandler)

  1.3 运行eviction manager

     隐藏的够深

  •      Run(updates <-chan kubetypes.PodUpdate)  -> 
  •      fastStatusUpdateOnce()  ->
  •      updateRuntimeUp()  ->
  •      initializeRuntimeDependentModules()  -> 
  •      kl.evictionManager.Start(kl.StatsProvider, kl.GetActivePods, kl.podResourcesAreReclaimed, evictionMonitoringPeriod)

2. Start函数

  路径:pkg/kubelet/eviction/eviction_manager.go

3. synchronize函数

  3.1 buildSignalToRankFunc函数和buildSignalToNodeReclaimFuncs函数

  • buildSignalToRankFunc注册signal资源函数
  • buildSignalToNodeReclaimFuncs注册signal reclaim函数
	// build the ranking functions (if not yet known)
	// TODO: have a function in cadvisor that lets us know if global housekeeping has completed
	if m.dedicatedImageFs == nil {
		hasImageFs, ok := diskInfoProvider.HasDedicatedImageFs()
		if ok != nil {
			return nil
		}
		glog.Infof("zzlin managerImpl synchronize m.dedicatedImageFs == nil  &hasImageFs: %v", &hasImageFs)
		m.dedicatedImageFs = &hasImageFs
		m.signalToRankFunc = buildSignalToRankFunc(hasImageFs)
		m.signalToNodeReclaimFuncs = buildSignalToNodeReclaimFuncs(m.imageGC, m.containerGC, hasImageFs)
	}

  3.2 Get函数获取node以及pod信息

  路径pkg/kubelet/server/stats/summary.go

	activePods := podFunc()
	updateStats := true
	summary, err := m.summaryProvider.Get(updateStats)
	if err != nil {
		glog.Errorf("eviction manager: failed to get get summary stats: %v", err)
		return nil
	}

  3.3 makeSignalObservations函数

    显示signal资源情况包括如下:

  • imagefs.inodesFree
  • pid.available
  • memory.available
  • allocatableMemory.available
  • nodefs.available
  • nodefs.inodesFree
  • imagefs.available
	// make observations and get a function to derive pod usage stats relative to those observations.
	observations, statsFunc := makeSignalObservations(summary)
	debugLogObservations("observations", observations)

猜你喜欢

转载自blog.csdn.net/zhonglinzhang/article/details/84245188