K8s Scheduler Scheduler

Author: Ali cloud Yunqi No
link: https: //zhuanlan.zhihu.com/p/101908480
Source: know almost
copyrighted by the author. Commercial reprint please contact the author authorized, non-commercial reprint please indicate the source.

REVIEW : Kubernetes as the currently most popular vessel automation platform for operation and maintenance, in order to achieve a flexible declarative layout container, paper-based version to v1.16 details the basic framework K8s scheduling, processes, and the main filter, Score algorithm, etc., and describes two ways to implement custom scheduling capabilities.

Scheduling Process

Scheduling Process Overview

Kubernetes As today is the most mainstream of container operation and maintenance automation platform, as a container choreographed K8s core components kube-scheduler will be the protagonist of my presentation today is to introduce versions of the following release-1.16 , based on the lower figure is kube-scheduler several main components:

 

 

Policy

Scheduler scheduling policy startup configuration currently supports three modes, the configuration file / command line parameter / ConfigMap. Scheduling policy which can be configured to use a filter (Predicates) to specify scheduling the main flow, scoring device (Priorities), external expansion scheduler (Extenders), and the latest supported SchedulerFramwork custom extension points (Plugins).

Informer

Scheduler by informer mechanism K8s List + Watch to obtain scheduling kube-apiserver required from when the start data is for example: Pods, Nodes, Persistant Volume (PV), Persistant Volume Claim (PVC), etc., and to do some data as a pretreatment of the scheduler Cache.

Scheduled pipeline

By Informer will be scheduled Pod inserted into the Queue, into Pipeline Pipeline cyclically performed from Queue Pop Pod waiting to be scheduled.

Scheduled pipeline (Schedule Pipeline) there are three main stages: Scheduler Thread, Wait Thread, Bind Thread.

  • Scheduler Thread stages : from architecture diagram above you can see Schduler Thread undergo Pre Filter -> Filter -> Post Filter-> Score -> Reserve, can be simply understood as the Filter -> Score -> Reserve.

Filter stage for selecting the Spec described Pod meet Nodes; Score for scoring and sorting stage from after the Filter Nodes; Pod Reserve, with the optimum phase of the Node NodeCache the sorted, indicating that the Pod has been assigned to this Node time, waiting for the next scheduled to make a Pod on the Node Filter and were able to see the Pod Score of just distribution.

  • Wait Thread Phase: This phase can be used to wait Pod Ready resources associated waiting, for example, PVC, PV waiting successfully created, or waiting Gang scheduling Pod associated scheduling success like;
  • Bind Thread Stage: used to associate Pod and Node persistence Kube APIServer.

Only in the entire pipeline Scheduler Thread scheduling stage is a serial scheduling a Pod Pod, are executed in parallel asynchronous Wait Bind stage and Pod.

Detailed scheduling process

After the finished solution and relationships major role member kube-scheduler, the next depth understanding of the specific working principle of the Scheduler Pipeline, the following is a detailed flowchart kube-scheduler, the scheduling queue to explain:

 


SchedulingQueue three sub-queue activeQ, backoffQ, unschedulableQ.

Scheduler activated when all waiting to be scheduled Pod will enter activieQ, activeQ are sorted by priority Pod's, Scheduler Pipepline obtains a Pod from activeQ for Pipeline performs scheduling process, when the scheduling fails directly according to the case of selecting to enter unschedulableQ or backoffQ If Node Cache, Pod Cache scheduler Cache and other changes during the current Pod scheduled to enter backoffQ, otherwise enter unschedulableQ.

unschedulableQ periodically longer time (e.g. 60 seconds) or brush into activeQ backoffQ, or when the Scheduler Cache Pod trigger changes associated brush into activeQ or backoffQ; backoffQ will unschedulableQ backoff mechanism to allow comparison to be scheduled relatively quickly Pod enter activeQ rescheduled.

Detailed information Scheduler Thread phase, a waiting Pod Scheduler Pipeline get scheduled, will be performed to get the relevant Node Filter NodeCache from the inside matching logic, which is a space on the optimization process from NodeCache traversal algorithm Node simple can be summarized to avoid any node at the same time takes into account the disaster recovery scheduled sampling schedule .

Specific optimization algorithm logic (interested students can watch the node_tree.go Next Method): In the NodeCache, Node heap is divided according to zone. In the filter stages, as it will maintain a NodeCache zondeIndex, each Node Pop a filter, ZONEINDEX moved back one position and then taken out from a node in the node list for the zone.

Can be seen on the ordinate have a nodeIndex, it will increment every time. If the current node zone of no data, it will take the data from the next zone. The process is probably zoneIndex from left to right, nodeIndex from top to bottom, ensuring got Node node is in accordance with the zone break up, in order to achieve to avoid any node at the same time taking into account az balanced deployment of nodes. (Release-v.1.17 latest version of the algorithm has been canceled, why should not consider prefer to cancel and node of the Pod prefer, there is no requirement to achieve Spec Pod's)

Sampling schedule sampling size which is here briefly, the default sampling rate, the formula = Max (5, 50 - node cluster number / 125), sample size = Max (100, sampling rate, number of clusters Node *).

Here is an example: a node size of 3000 nodes, then the sampling ratio * = Max (5, 50 - 3000/125 ) = 26%, then the sample size * = Max (100, 3000 0.26) = 780, the pipeline scheduling which, as long as the Filter to match the candidate node 780, the process can be stopped Filter, come Score stage.

Score phase according to count points plug Policy configuration, sort node with the highest score as SelectHost. This is then dispensed onto the Pod Node, a process called Reserver stage may be referred to books camped. Pod camped process modifications in the status of PodCache Assumed (the state in memory).

Scheduling process involves lifecycle Pod state machine , several major here briefly state the Pod: the Initial (Virtual state) -> Assumed (Reserver) -> Added -> Deleted (virtual state); if Informer watch by the Pod It has been determined that the data assigned to the node when the state will become the Pod Added. Select End node in Bind, when there might Bind fail, when Bind failed rollback will do, is to do the pre-occupation of the books of return data Assumed Initial, the Assumed state is erased from the Node inside the Pod Clear books.

If the Bind failure, will again throw back to the Pod unschedulableQ queue inside. In scheduling queues, under what circumstances would the Pod backoffQ in it? This is a very detailed point. If there is such a scheduling period, Cache changed, Pod will put backoffQ inside. In backoffQ inside the waiting time will be shorter than the time at which unschedulableQ, backoffQ there is a strategy downgrade, it is the exponential power of 2 downgrade. Assume retry the first time 1s, that is the second time 2s, third is 4s, the fourth is the 8s, up to 10s.

Scheduling algorithm

Predicates (filter)

Filter functional purpose they can be divided into four categories:

  • Storage-related match
  • Pode and Node relevant match
  • Pod Pod and matching relevant
  • Pod beaten relevant

Storage-related

Several filters associated storage function:

  • NoVolumeZoneConflict provided zoneaz limit nodes to be matched to keep pv pv on the label, pvc associated;
  • MaxCSIVolumeCountPred , is used to verify a specified number of the largest single pvc pv Provision limit on the CSI plugin;
  • CheckVolumeBindingPred , the binding process pv pvc and in its check logic, write logic inside the complex, are mainly how to reuse pv;
  • NoDiskConfict , SCSI storage will not be repeated volume.

Pod and matching relevant Node

  • CheckNodeCondition: check node is ready to be scheduled, check node.condition the condition type: Ready to true and false NetworkUnavailable as well Node.Spec.Unschedulable is false;
  • CheckNodeUnschedulable: marked a NodeUnschedulable node in the node, we can not be scheduled by kube-controller direct labeling on this node, then this node will not be scheduled. In the 1.16 version, this Unschedulable has become a Taints. That need to check what is marked on the Pod Tolerates can not tolerate this Taints;
  • PodToleratesNodeTaints: checking whether the Node Taints Pod Tolerates comprising;
  • PodFitsHostPorts: the Ports Container Pod statement on whether the check is being used on already assigned Node Pod;
  • MatchNodeSelector : check Pod.Spec.Affinity.NodeAffinity and Pod.Spec.NodeSelector matches of Node Labels.

Pod Pod and matching relevant

MatchinterPodAffinity: mainly checking logic PodAffinity and PodAntiAffinity maximum complexity Affinity wherein there is described supports inside PodAffinityTerm TopologyKey (may represent a node / zone / az topology, etc.), this is actually a performance killer.

Pod beaten relevant

  • EvenPodsSpread
  • CheckServiceAffinity

EvenPodsSpread

This is a new feature, first look at EvenPodsSpread in Spec Description:
- Description of a qualified group of Pod beaten on the specified requirements of TopologyKey.

Here we look at how to describe a set of Pod, as shown below:

spec:
  topologySpreadConstraints:
  - maxSkew: 1
    whenUnsatisfiable: DoNotSchedule
    topologyKey: k8s.io/hostname
    selector:
      matchLabels:
        app: foo
      matchExpressions:
      - key: app
        operator: In
        values: ['foo', 'foo2']

topologySpreadConstraints : used to describe the Pod should be equalized on what topology broken, and the relationship between a plurality of topologySpreadConstraint;
Selector : Pod for listing a set of topological description need to meet the beaten
topologyKey : what role Topology on;
a MAXSKEW : the maximum number of allowed imbalance;
whenUnsatisfiable : when the time does not meet topologySpreadConstraint strategy, DoNotSchedule: shows the effect on filter stage, ScheduleAnyway: score acting on stage.

For example the following description:

 


selector selecting all the lable in line with app = foo the pod, it must be beaten at the zone level, allowing the maximum number of uneven one.
Three cluster zone, the value of app = foo label image above the Pod in zone1 and zone2 are assigned a pod.
Uneven number calculation formula is: ActualSkew = count [topo] - min (count [topo])
First, based on the acquired selector list of qualifying Pod
Next, the packet will follow to get topologyKey count [topo]

As shown in FIG:

1 is assumed maxSkew, if assigned to the zone1 / zone2, skew is 2, maxSkew greater than previously set. This is not match, it can only be assigned to zone3. If it is assigned to zone3, min (count [topo]) is 1, count [topo] is 1, that skew is equal to 0, and therefore can only be assigned to zone2.

2 is a MAXSKEW assumed, assigned to z1 (z2), skew value 2/1/0 (1/2/0), a maximum of 2, satisfies <= maxSkew. That z1 / z2 / z3 are allowed to be selected.

EvenPodsSpread can achieve a balanced set of Pod beaten on demand through a TopologyKey, if necessary requirements are balanced on each topo maxSkew can be set to 1, of course, this description lacks some control, for example, must be allocated in a number of topologyValue limit.

Priorities

Then look at the scoring algorithm, to solve the main problem is scoring algorithm clusters of debris, disaster recovery, water level, pro-and anti-affinity and so on.

By category can be divided into four categories:

  • Node Level
  • Pod break (topp, service, controller)
  • Node & affinity and anti-affinity
  • Pod & affinity and anti-affinity

Water resources

Next comes the first resource level score is associated.

 


Scoring algorithm nodes associated with resource level there are four, as shown in FIG.

 

 

  • The concept of water resources of the formula

Request : the Node already allocated resources; Allocatable : the Node can be scheduled resource

  • Priority beaten

把 Pod 分到资源空闲率最高的节点上,而非空闲资源最大的节点,公式:资源空闲率=(Allocatable - Request) / Allocatable,当这个值越大,表示分数越高,优先分配到高分数的节点。其中(Allocatable - Request)表示为Pod分配到这个节点之后空闲的资源数。

  • 优先堆叠

把 Pod 分配到资源使用率最高的节点上,公式:资源使用率 = Request / Allocatable,资源使用率越高,表示得分越高,会优先分配到高分数的节点。

  • 碎片率

是指 Node 上的多种资源之间的资源使用率的差值,目前支持 CPU/Mem/Disk 三类资源, 假如仅考虑 CPU/Mem,那么碎片率的公式 = Abs[CPU(Request / Allocatable) - Mem(Request / Allocatable)] 。举一个例子,当 CPU 的分配率是 99%,内存的分配率是 50%,那么碎片率 = 99% - 50% = 50%,那么这个例子中剩余 1% CPU, 50% Mem,很难有这类规格的容器能用完 Mem。得分 = 1 - 碎片率,碎片率越高得分低。

  • 指定比率

可以在 Scheduler 启动的时候,为每一个资源使用率设置得分,从而实现控制集群上 node 资源分配分布曲线。

Pod 打散

 


Pod 打散为了解决的问题为:支持符合条件的一组 Pod 在不同 topology 上部署的 spread 需求。

  • SelectorSpreadPriority

用于实现 Pod 所属的 Controller 下所有的 Pod 在 Node 上打散的要求。实现方式是这样的:它会依据待分配的 Pod 所属的 controller,计算该 controller 下的所有 Pod,假设总数为 T,对这些 Pod 按照所在的 Node 分组统计;假设为 N (表示为某个 Node 上的统计值),那么对 Node上的分数统计为 (T-N)/T 的分数,值越大表示这个节点的 controller 部署的越少,分数越高,从而达到 workload 的 pod 打散需求。

  • ServiceSpreadingPriority

官方注释上说大概率会用来替换 SelectorSpreadPriority,为什么呢?我个人理解:Service 代表一组服务,我们只要能做到服务的打散分配就足够了。

  • EvenPodsSpreadPriority

用来指定一组符合条件的 Pod 在某个拓扑结构上的打散需求,这样是比较灵活、比较定制化的一种方式,使用起来也是比较复杂的一种方式。

因为这个使用方式可能会一直变化,我们假设这个拓扑结构是这样的:Spec 是要求在 node 上进行分布的,我们就可以按照上图中的计算公式,计算一下在这个 node 上满足 Spec 指定 labelSelector 条件的 pod 数量,然后计算一下最大的差值,接着计算一下 Node 分配的权重,如果说这个值越大,表示这个值越优先。

Node 亲和&反亲和

 

 

  • NodeAffinityPriority,这个是为了满足 Pod 和 Node 的亲和 & 反亲和;
  • ServiceAntiAffinity,是为了支持 Service 下的 Pod 的分布要按照 Node 的某个 label 的值进行均衡。比如:集群的节点有云上也有云下两组节点,我们要求服务在云上云下均衡去分布,假设 Node 上有某个 label,那我们就可以用这个 ServiceAntiAffinity 进行打散分布;
  • NodeLabelPrioritizer,主要是为了实现对某些特定 label 的 Node 优先分配,算法很简单,启动时候依据调度策略 (SchedulerPolicy)配置的 label 值,判断 Node 上是否满足这个label条件,如果满足条件的节点优先分配;
  • ImageLocalityPriority,节点亲和主要考虑的是镜像下载的速度。如果节点里面存在镜像的话,优先把 Pod 调度到这个节点上,这里还会去考虑镜像的大小,比如这个 Pod 有好几个镜像,镜像越大下载速度越慢,它会按照节点上已经存在的镜像大小优先级亲和。

Pod 亲和&反亲和

InterPodAffinityPriority

先介绍一下使用场景:

  • 第一个例子,比如说应用 A 提供数据,应用 B 提供服务,A 和 B 部署在一起可以走本地网络,优化网络传输;
  • 第二个例子,如果应用 A 和应用 B 之间都是 CPU 密集型应用,而且证明它们之间是会互相干扰的,那么可以通过这个规则设置尽量让它们不在一个节点上。

NodePreferAvoidPodsPriority

用于实现某些 controller 尽量不分配到某些节点上的能力;通过在 node 上加 annotation 声明哪些 controller 不要分配到 Node 上,如果不满足就优先。

如何配置调度器

配置调度器介绍

 


怎么启动一个调度器,这里有两种情况:

  • 第一种我们可以通过默认配置启动调度器,什么参数都不指定;
  • 第二种我们可以通过指定配置的调度文件。

如果我们通过默认的方式启动的话,想知道默认配置启动的参数是哪些?可以用 --write-config-to 可以把默认配置写到一个指定文件里面。
下面来看一下默认配置文件,如下图所示:

 

 

  • algorithmSource:算法提供者,目前提供三种方式:Provider、file、configMap,后面会介绍这块;
  • percentageOfNodesToscore :调度器提供的一个扩展能力,能够减少 Node 节点的取样规模;
  • schedulerName:用来表示调度器启动的时候,负责哪些 Pod 的调度;如果没有指定的话,默认名称就是 default-scheduler;
  • bindTimeoutSeconds:用来指定 bind 阶段的操作时间,单位是秒;
  • clientConnection:用来配置跟 kube-apiserver 交互的一些参数配置。比如 contentType,是用来跟 kube-apiserver 交互的序列化协议,这里指定为 protobuf;
  • disablePreemption:关闭抢占调度;
  • hardPodAffinitySymnetricweight:配置 PodAffinity 和 NodeAffinity 的权重是多少。

algorithmSource

 


这里介绍一下过滤器、打分器等一些配置文件的格式,目前提供三种方式:

  • Provider
  • file
  • configMap

如果指定的是 Provider,有两种实现方式:

  • 一种是 DefaultPrivider;
  • 一种是 ClusterAutoscalerProvider。

ClusterAutoscalerProvider 是优先堆叠的,DefaultPrivider 是优先打散的。关于这个策略,当你的节点开启了自动扩容,尽量使用 ClusterAutoscalerProvider 会比较符合你的需求。

这里看一下策略文件的配置内容,如下图所示:

 


这里可以看到配置的过滤器 predicates,配置的打分器 priorities,以及我们配置的扩展调度器。这里有一个比较有意思的参数就是:alwaysCheckAllPredicates。它是用来控制当过滤列表有个返回 false 时,是否继续往下执行?默认的肯定是 false;如果配置成 true,它会把每个插件都走一遍。

如何扩展调度器

Scheduler Extender

 


首先来看一下 Schedule Extender 能做什么?在启动官方调度器之后,可以再启动一个扩展调度器。

通过配置文件,如上文提到的 Polic 文件中 extender 的配置,包括 extender 服务的 URL 地址、是否 https 服务,以及服务是否已经有 NodeCache。如果有 NodeCache,那调度器只会传给 nodenames 列表。如果没有开启,那调度器会把所有 nodeinfo 完整结构都传递过来。

ignorable 这个参数表示调度器在网络不可达或者是服务报错,是否可以忽略扩展调度器。managedResources,官方调度器在遇到这个 Resource 时会用扩展调度器,如果不指定表示所有的都会使用扩展调度器。

这里举个 GPU share 的例子。在扩展调度器里面会记录每个卡上分配的内存大小,官方调度器只负责 Node 节点上总的显卡内存是否足够。这里扩展资源叫 example/gpu-men: 200g,假设有个 Pod 要调度,通过 kube-scheduler 会看到我们的扩展资源,这个扩展资源配置要走扩展调度器,在调度阶段就会通过配置的 url 地址来调用扩展调度器,从而能够达到调度器能够实现 gpu-share 的能力。

Scheduler Framework

 


这里分成两点来说,从扩展点用途和并发模型分别介绍。

扩展点的主要用途

扩展点的主要用途主要有以下几个:

  • QueueSort:用来支持自定义 Pod 的排序。如果指定 QueueSort 的排序算法,在调度队列里面就会按照指定的排序算法来进行排序;
  • Prefilter:对 Pod 的请求做预处理,比如 Pod 的缓存,可以在这个阶段设置;
  • Filter:就是对 Filter 做扩展,可以加一些自己想要的 Filter,比如说刚才提到的 gpu-shared 可以在这里面实现;
  • PostFilter:可以用于 logs/metircs,或者是对 Score 之前做数据预处理。比如说自定义的缓存插件,可以在这里面做;
  • Score:就是打分插件,通过这个接口来实现增强;
  • Reserver:对有状态的 plugin 可以对资源做内存记账;
  • Permit:wait、deny、approve,可以作为 gang 的插入点。这个可以对每个 pod 做等待,等所有 Pod 都调度成功、都达到可用状态时再去做通行,假如一个 pod 失败了,这里可以 deny 掉;
  • PreBind:在真正 bind node 之前,执行一些操作,例如:云盘挂载盘到 Node 上;
  • Bind:一个 Pod 只会被一个 BindPlugin 处理;
  • PostBind:bind 成功之后执行的逻辑,比如可以用于 logs/metircs;
  • Unreserve:在 Permit 到 Bind 这几个阶段只要报错就回退。比如说在前面的阶段 Permit 失败、PreBind 失败, 都会去做资源回退。

并发模型

并发模型意思是主调度流程是在 Pre Filter 到 Reserve,如上图浅蓝色部分所示。从 Queue 拿到一个 Pod 调度完到 Reserve 就结束了,接着会把这个 Pod 异步交给 Wait Thread,Wait Thread 如果等待成功了,就会交给 Bind Thread,就是这样一个线程模型。

自定义 Plugin

如何编写注册自定义 Plugin?

 


这里是一个官方的例子,在 Bind 阶段,要将 Pod 绑定到某个 Node 上,对 Kube-apiserver 做 Bind。这里可以看到主要有两个接口,bind 的接口是声明调度器的名称,以及 bind 的逻辑是什么。最后还要实现一个构造方法,告诉它的构造方法是怎样的逻辑。

启动自定义 Plugin 的调度器:

  • vendor
  • fork

 


在启动的时候可以通过两种方式去注册:

  • 第一种方式是通过自己编写一个脚本,通过 vendor 把调度器的代码 vendor 进来。在启动 scheduler.NewSchedulerCommand 的时候把 defaultbinder 注册进去,这样就可以启动一个调度器;
  • 第二种方式是可以 fork kube-scheduler 的源代码,然后把调度器的 defaultbinder 通过 register 插件注册进去。注册完这个插件,去 build 一个脚本、build 一个镜像,然后启动的时候,在配置文件的 plugins.bind.enable 启动起来。

本文总结

本文内容到此就结束了,这里为大家简单总结一下:

  • 第一部分跟大家介绍了下调度器的整体工作流程,以及一些计算的算法优化;
  • 第二部分详细介绍调度的主要几个工作组件过滤器组件、score 组件的实现,并列举几个 score 的使用场景;
  • 第三部分介绍调度器的配置文件的用法说明,让大家可以通过这些配置来实现自己期望的调度行为;
  • 第四部分介绍了一些高级用法,怎么通过 extender/framework 扩展调度能力,来满足特殊业务场景的调度需求。

Guess you like

Origin www.cnblogs.com/qinghe123/p/12204779.html