Chapter 18: Kubernetes scheduling and resource management

Kubernetes scheduling and resource management

This lesson mainly about the contents of three parts:

  1. Kubernetes scheduling process;
  2. Kubernetes foundation scheduling capabilities (resource scheduling, dispatching relations);
  3. Kubernetes advanced scheduling capabilities (priority, preemption).

In addition, with regard to the scheduler architecture and algorithm part, will be introduced by my colleague in the next lesson for everyone.

Kubernetes scheduling process

First look at the first portion - the scheduling process Kubernetes. As shown below, a simple draw Kubernetes cluster architecture, which includes a kube-ApiServer, a set of webhooks Controller, and a default scheduler kube-Scheduler, there are two physical nodes and Node1, Node2, They were deployed in the above two kubelet.

avatar

We look, if this Kubernetes To submit a pod cluster, its scheduling process is what kind of a process?

Suppose we have a written document yaml, is the next map in the orange circle pod1, then we submit this document to kube-ApiServer yaml inside.

avatar

At this point ApiServer will put that route requests to be created to Controlles our webhooks be verified.

avatar

After passing through check, ApiServer generated inside a POD in the cluster, but this time the resulting POD, nodeName it is empty, and its phase is Pending state. After generating this pod, kube-Scheduler and kubelet able to watch the pod of an event is generated, kube-Scheduler found this pod of nodeName is empty after, you will think this pod is in the unscheduled state.

avatar

Next, it will be inside the pod to get their scheduling, through a series of scheduling algorithms, including a series of filters and scoring algorithm, Schedule will select the most appropriate one node, and the node of the station the name binding on the spec of this pod, complete a scheduling process.

At this point we found the pod of spec, nodeName has become Node1 update this node, after you've updated nodeName, on Node1 of this kubelet will watch to this pod is part of a pod on its own node.

avatar

Then it will get this pod on operating a node, including the creation of a number of container storage and network, and all other resources are finally ready, kubelet status will be updated to Running, such a complete scheduling process is over.

Just a scheduling process through the presentation, we use a word to summarize the scheduling process: it is actually doing one thing, that is, the pod into the appropriate on the node.

Here is the keyword "right" and what is appropriate? Here are a few features suitable definition:

  1. First of all to meet the resource requirements of the pod;
  2. Second, we must meet the requirements of some of the special relationship of the pod;
  3. Again node to meet the requirements of some restrictions;
  4. Finally, do the rational use of resources of the entire cluster.

After do these requirements, we can consider a pod placed on the appropriate node.

Next, I will introduce Kubernetes is how do meet these requirements and node of the pod.

Kubernetes scheduling force base

Here, I will introduce the basic scheduling capabilities Kubernetes, Kubernetes foundation scheduling capabilities will be used to expand the introduction of two parts:

  1. The first part is resource scheduling - Kubernetes introduce some of the basic configuration of Resources, as well as the concept of Qos, and Resource Quota concepts and use;
  2. The second part is scheduled relation - in relation scheduling relationship describes two scenarios:
  3. The relationship between the pod and the scene pod, including how to kiss and a pod, how to exclusive a pod?
  4. The relationship between the pod and the scene node, including how to kiss and a node, as well as some node how to limit the pod scheduling up.

How to meet the resource requirements Pod

pod resource allocation method

avatar

apiVersion: v1
kind: Pod
metadata:
  namespace: demo-ns
  name: demo-pod
spec:
  containers:
  - image: nginx:latest
    name: damo-container
    resources:
      requests:
        cpu: 2
        memory: 1Gi
      limits:
        cpu: 2
        memory: 1Gi

The figure is a demo pod spec, our resources actually fill in the pod spec inside, Container resources inside of which has a key inside.

resources actually consists of two parts: The first part is the Request; second part limits.

These two parts of the content inside is exactly the same, but the meaning it represents is different: The request is on behalf of the pod basic security at the end of some of the resources required; representative limit is a restriction on available capacity upper limit of the pod. Specific request, the concept of limit, in fact, a map is a structure of resources, which it can fill key different resources.

We can roughly be divided into four categories of basic resources:

  • The first resource is a CPU;
  • The second category is the Memory;
  • The third category is ephemeral-storage, an interim storage;
  • The fourth category is the general expansion of resources, such as GPU.

In the words of the CPU, such as the example above, the application of the two CPU, can also be written this way of 2000m convert decimal to express sometimes possible for the CPU may be a fraction of demand, such as 0.2, that is, says 200m. On memory and storage, it is a binary expression. As shown in the right diagram, the 1GB of memory applications, can also be converted into a 1024mi expression, so that we can express more clearly the need for memory.

On the expansion of resources, Kubernetes there is a requirement that the expansion of resources must be an integer, so we can not apply to such resources GPU 0.5, can only apply for one or two GPU GPU, here to tell us about the basic resources for everyone the mode of application.

Next, I'll tell you about the details of the request and the limit in the end what is the difference and how to elicit the concept of Qos by request / limit.

Pod QoS type

K8s provided inside the pod resources to fill in two ways: The first is a request, the second is the limit. It actually provides the user with an elastic definition of Pod capacity. For instance, we can fill in for Request 2 CPU, CPU to limit fill four, so I hope that in fact represents the security at the end there is a 2 CPU capacity, but in fact is in idle, they can use the four GPU.

Speaking of this elastic capacity, we have to mention a concept: the concept of Qos. Qos What is it? Qos stands for Quality of Service, it is actually Kubernetes used to express the quality standards of service in a pod on resource capacity, Kubernetes provides three types of Qos Class:

  1. The first is Guaranteed, it is a kind of high Qos Class, usually with Guaranteed to be configured for a number of pod required resources and support capabilities;
  2. The second category is Burstable, it is actually the middle of a Qos label, usually some hope for the flexible configuration capability of the pod to Burstable;
  3. The third category is BestEffort, we know by name, it is a best-effort quality of service.

K8s fact, there is not a good place, the user is unable to specify their own pod is which type Qos, but automatically mapped on Qos Class by combining and limit the request.

By the example above, we can see: If I submitted a spec above, after the submission of success in the spec, Kubernetes automatically added something to the status, which is qosClass: Guaranteed, when the user's own submission, is not law define your own Qos level. So we will call this approach implicit Qos class usage.

Pod QoS configuration

Then tell us, how do we determine what we want Qos level through a combination of request and limit.

Guaranteed Pod

avatar

First, how do we create out of a Guaranteed Pod? Kubernetes which has one request: If you want to create a Guaranteed Pod, then your basic resources (that is, including CPU and memory), it must request == limit, other resources may not be equal. Only under such conditions, it is created out of the pod is one kind of Guaranteed Pod, otherwise it will belong to Burstable, or BestEffort Pod.

Burstable Pod

Then look at how we create out of a Burstable Pod, compared to the broad range of Burstable Pod, as long as it satisfies CPU / Memory of the request and limit are not equal, it is a kind of Burstable Pod.

avatar

For example, the above example, you can not fill in the resource memory, fill out the CPU resources, it is a kind of Burstable Pod.

best-effort delivery Pod

avatar

The third category BestEffort Pod, it is actually kind of conditions are dead use. It must be a request / limit does not fill all the resources of this is the kind BestEffort Pod.

Here it can be seen that, by request, and limit the use of different, may be combined in different Pod Qos.

Different QoS performance

Next, tell you about: different Qos in scheduling and underlying performance, what kind of different? Different Qos, it actually has some different on scheduling and underlying performance. For example, the performance of the scheduling, the scheduler will use the scheduling request, that is, no matter how much you with the limit, it will not be scheduled to use, it will only use request scheduling.

On the bottom layer, different Qos less the same performance. For example, CPU, it is actually a request by dividing the weight of the different Qos, its request is completely different, such as Burstable and BestEffort, it may request could fill a small digital or not fill, so, its weight is actually very low. Like BestEffort, its weight may be only two, and Burstable or Guaranteed, its weight can be more to a few thousand.

In addition, when we turn on a feature kubelet, called cpu-manager-policy = static, we Guaranteed Qos, if its request is an integer, then, for example, with the 2, it will be tied to nuclear Guaranteed Pod. That is, like the following specific example, it is assigned to CPU0 and CPU1 Guaranteed Pod.

avatar

Guaranteed noninteger / Burstable / BestEffort, they will be placed a CPU, consisting of a CPU share pool, for example, like the example above, this node if that has eight cores, has been assigned to the two nuclear integer Guaranteed nuclear tied, then the remaining six cores CPU2 ~ CPU7, it will be a non-integer Guaranteed / Burstable / BestEffort shared, then they re-dividing time slices depending on the weights used six cores CPU.

The other will be divided according to different Qos in memory: OOMScore. For instance Guaranteed, it will configure the default of OOMScore -998; Burstable, then it will be allocated based on the relationship OOMScore 2-999 memory size and design of the node. BestEffort will permanently assigned OOMScore 1000, the higher the score OOMScore then occur when OOM priority in the physical machine will kill off.

Also in action on the eviction of nodes, different Qos is not the same, for example, occurs when the eviction will give priority to the pod expulsion BestEffort. So different Qos in fact, the performance of the underlying is different. This in turn requires that we in the production process, to configure Limits and Request resources in accordance with the requirements and attributes of different services, so that rational planning Qos Class.

Resource Quota

In the production, we will encounter a scenario: If the cluster is composed of more than one person submitted simultaneously, or use multiple services at the same time, we certainly want to limit the amount of a business or an individual submission, to prevent the entire cluster resources It will be out of use, resulting in another business without resource usage.

avatar

Kubernetes provides us with a capability called: ResourceQuota method. It can be done to limit the amount of resources namespace.

Specific approach shown in FIG yaml right above, it can be seen to include a hard spec and scopeSelector. In fact, the contents of hard and Resourcelist like, where you can fill in some basic resources. But it is more abundant than ResourceList that it can fill in some of the Pod, which can limit the number of Pod ability. Then scopeSelector richer indexing capabilities defined for the Resource method.

Example, the index of non-BestEffort POD, 1000 is limited cpu, memory is 200G, Pod 10 is, in addition to providing and Scope NotBestEffort, it provides a richer range of indices, including Terminating / Not Terminating, BestEffort / NotBestEffort, PriorityClass.

When we create such a ResourceQuota act on the cluster, if the user really ultra resources, the performance of behavior: it when submitting Pod spec, will receive a 403 forbidden error, suggesting that exceeded quota. So that users can not submit cpu or memory, or Pod amount of resources.

If you submit a resource is not included in this ResourceQuota program inside, you can still be successful. This is the basic usage of Kubernetes in ResourceQuota. We can use ResourceQuota way to do limit the amount of each namespace of resources to ensure the use of resources of other users.

Summary: how to meet the resource requirements Pod?

The above introduction is over use of basic resources, that is, how do we meet Pod resource requirements. What follows is a summary:

  • Pod to a reasonable allocation of resource requirements
  • CPU/Memory/EphemeralStorage/GPU
  • To select a different QoS for different operational characteristics and Pod by Request Limit
  • Guaranteed: sensitive, need to protect business
  • Burstable: time-sensitive, business need for flexibility
  • BestEffort: Tolerable of business
  • NS ResourceQuota for each configuration to prevent excessive use, to protect other people's resources are available

Pod Pod relationship and how to meet the requirements?

Next to tell you about the relationship of scheduling Pod, is the relationship between the scheduling of Pod and Pod. We may encounter some scenes in peacetime use: For example, a Pod Pod must be put together and another, and another Pod or can not be put together.

Under this requirement, Kubernetes provides two capabilities:

  • The first category is called Pod affinity and capacity scheduling: PodAffinity;
  • The second type is the anti-Pod affinity scheduling: PodAntiAffinity.

Pod affinity scheduling

avatar

First, we look pro-Pod and scheduling, if I want a Pod Pod and another together, then we can see examples of the wording in the figure above, fill in the podAffinity, then fill in the required requirements.

In this example, the belt must be dispatched to the key: the node k1 where the Pod, and particle size was broken to break the particle size in accordance with the node index. In this case, if it can be found with a key: the node where the Pod k1, the schedule will be successful. If this does not exist such a Pod cluster node, it is not enough resources or time, it will dispatch failure. This is a strict parent and scheduling, we call attempt affinity scheduling.

avatar

Sometimes we do not need such strict scheduling policy. This time may be required to change the preferred, become a pro and priority scheduling. Priority scheduling is possible with a key: the node where the Pod k2. And this list may be a preferred choice which can fill a plurality of conditions, such as the weight is equal to 100 key: k2, weight 10 is equal to key: k1. That scheduler when scheduling priority assigned to this Pod weight fraction node higher up the dispatch condition.

Pod anti affinity scheduling

The above describes affinity scheduling, and anti-affinity scheduling is actually more like a pro and scheduling. For example, the function is inverted in the syntax is basically the same, but podAffinity replaced podAntiAffinity, is required to achieve the effect of mandatory anti-affinity and anti-affinity a preferred priority.

I here while two examples: a prohibition zone to the scheduled key: the node where the Pod k1 tags; other anti-affinity scheduling priority band to the key: the label of the node where k2 Pod.

avatar

In addition Kubernetes In addition to this Operator grammar In, also it provides more abundant grammatical combinations to use for everyone. For example, In / NotIn / Exists / DoesNotExist these combinations. With the example above is In, for example, a first force and affinity anti inside example, we must prohibit the scheduling corresponds to the band key: the node where the Pod k1 tag.

The same function can also be used Exists, Exists In range may be greater than the range, when filled Operator Exists, do not need to fill in values. It did effect is to prohibit scheduling with the key: the node where the Pod k1 label, no matter what the value of values, so long as the k1 the node where the Pod key tag can not be scheduled in the past.

These are the relations between the scheduling Pod and Pod.

How to meet the Pod relationship with Node Scheduling

Pod relationship with the scheduling of Node also known as Node affinity scheduling, mainly to introduce two types of use.

NodeSelector

avatar

The first category is NodeSelector, this is a relatively simple class of games are played. For example, there is a scenario: scheduling must be brought to Pod K1: Node v1 on the label, then the claim can fill a nodeSelector Pod's spec. nodeSelector is actually a map structure, which can write directly on the requirements of the node labels, such as k1: v1. So I Pod will force dispatched to bring the k1: on Node v1 label.

NodeAffinity

NodeSelector is a very simple play, but the play there is a problem: it is a routine scheduling, priority scheduling if I wanted to, would not be able to do with nodeSelector. So Kubernetes community has added a new play called NodeAffinity.

avatar

It PodAffinity somewhat similar, but also provides two types of scheduling strategy:

  • The first category is required, it must be scheduled on a certain type of the Node;
  • The second type is preferred, is scheduled on a priority class Node.

In its basic grammar and above PodAffinity and PodAntiAffinity is similar. On the Operator, NodeAffinity provide a richer content than PodAffinity the Operator. An increase of Gt and Lt, compare the value of play. When using the Gt, values ​​can only fill in the numbers.

Node tag / tolerance

There is a third category scheduling, to Node can play some markers, Pod restricted to certain scheduling Node. Kubernetes these markers referred Taints, which literally means pollution.

avatar

How do we limit Pod scheduled on some of the Node it? For example, now there is a node called demo-node, the node in question, I would like to limit the number of Pod scheduling up. Then you can make a taints to this node, taints including key, value, effect:

  • key is to configure the key
  • value is the content
  • This effect is marked behavior is what taints

Currently there are three taints Kubernetes behavior:

  1. NoSchedule prevent new Pod scheduling up;
  2. PreferNoSchedul try not scheduled to this;
  3. NoExecute will evict no corresponding Pods toleration, and will not schedule a new up. This strategy is very strict, we use the time to be careful.

As FIG green part, to hit the demo-node k1 = v1, and equal effect after NoSchedule. Its effect is: the new Pod did not specifically tolerate this taint, it is not scheduled to go up this node.

If some Pod is scheduled on this node, how should you do it? Then you can make a Pod Tolerations on the Pod. Blue can be seen from the figure above: fill in the spec Pod in a Tolerations, inside it also includes key, value, effect, and the three values ​​of the corresponding taint is complete, taint inside the key, value , effect is what, Tolerations which should fill the same content.

Tolerations more than one option Operator, Operator has two value: Exists / Equal. The concept Equal is necessary to fill in value, while above NodeAffinity Exists just say the same, do not need to fill in value, as long as the key value pairs, and considers it now taints match.

Examples of the above figure, to hit a Tolerations Pod, Pod only played this Tolerations in order to dispatch the green part played taints the Node up. The advantage is that Node can selectively scheduling some of the Pod up, but not all of the Pod can schedule up, so did some Pod scheduling restrictions to certain Node effect.

summary

We have introduced over the special relations and conditions of dispatch Pod / Node, the summary of what to do.

First, if there is a demand for dealing with the Pod Pod, and other such Pod Pod affinity relationship or mutually exclusive relationship, the following parameters can be configured to them:

  • PodAffinity
  • PodAntiAffinity

Pod and if present Node has an affinity, you can configure the following parameters:

  • NodeSelector
  • NodeAffinity

If the Node is some limit certain Pod scheduling, for example, some failures Node, or Node said special service can configure the following parameters:

  • Node -- Taints
  • Pod -- Tolerations

Kubernetes advanced scheduling capabilities

Following the presentation finished basic scheduling capabilities, let's look at the advanced scheduling capabilities.

Priority Scheduling

Priority scheduling and preemption, the main concepts are:

  • Priority
  • Preemption

First look at the four characteristics mentioned scheduling process, how we do the rational use of cluster? When the cluster resources enough, only through a combination of basic scheduling capabilities will be able to use a reasonable way. But if not enough resources, how do we do the rational use of cluster it? Usually there are two types of strategies:

  • First come first served policy (FIFO) - simple, relatively fair, quick
  • Priority policy (Priority) - in line with the characteristics of daily business

In actual production, if a first come first served policy, it is an unfair policy because the company's business there is definitely a high priority traffic and low priority traffic, so the priority strategies than first-served basis strategy is more able to meet everyday business features.

avatar

Then tell us about the priority scheduling priority policy is what a concept. For example, there has been a Node a Pod occupied, the Node only two CPU. Another high priority Pod comes, low priority Pod these two CPU should give high priority to the use of Pod. Low priority Pod need to return to the waiting queue, or business resubmit. This process is a process priority preemptive scheduling.

In Kubernetes years, PodPriority and Preemption, is characteristic priority and preemption, becomes stable in the v1.14 release. And PodPriority and Preemption are enabled by default.

Priority Scheduling Configuration

how to use?

How to use priority scheduling it? Create a priorityClass, then a different configuration for each priorityClassName Pod, thus completing the priority, and priority scheduling configuration.

avatar

On the right side of FIG defines two demo:

  • One is the creation of high priorityClass called, it is a high priority, score 10000;
  • Then priorityClass also created a low, its score is 100.

Pod and a third portion disposed on a high, low priorityClassName arranged on POD2, the blue portion of the display position of the configuration of the pod spec, is to fill in a priorityClassName inside spec: high. Pod priorityClass done so and configure it to open a priorityClass scheduling for the cluster.

Built-priority configuration

Of course Kubernetes which also built the default priority. As DefaultpriorityWhenNoDefaultClassExistis, if the cluster configuration is not DefaultpriorityWhenNoDefaultClassExistis that all Pod about this will be set to the value 0.

Another built-priority user configurable maximum priority limit: HighestUserDefinablePriority = 10000000000 (10 billion)

System level priority: SystemCriticalPriority = 20000000000 (20 billion)

Built-in system level priority:

  • system-cluster-critical
  • system-node-critical

This is the basic configuration and built-priority scheduling priority configuration.

Priority scheduling process

When finished the above configuration, the entire priority scheduling is a process of how it? The following will explain a simple process.

First, some only trigger priority scheduling but did not trigger preemptive scheduling process.

If there is a Pod1 and Pod2, Pod1 equipped with high priority, Pod2 low priority configuration. Pod1 and Pod2 submit to the dispatch queue.

avatar

When the processing queue scheduler will select a higher-priority scheduling Pod1, after the scheduling process Pod1 bound to Node1.

avatar

Secondly, and then choose a low-priority Pod2 subjected to the same process, bound to Node1.

avatar

This completes a simple priority scheduling process.

Preemptive priority process

If the Pod is not a high priority when scheduling resources, it will be a process of how to do?

The first is to keep up the same scene files, but placed ahead of the Pod0 on Node1, took up part of the resource. And also has Pod1 POD2 be scheduled, priority higher than Pod1 Pod2.

avatar

If the first scheduling Pod2 up, it passes through a series of scheduling bound to the Node1.

avatar

Followed by re-scheduling Pod1, has been in existence since the Node1 two Pod, lack of resources, it will encounter scheduling fail.

avatar

When the scheduling process fails Pod1 enter preempted, then the node will perform the screening of the entire cluster, to seize the last pick POD2 Pod is, at this time the scheduler will POD2 removable data from Node1.

avatar

Then Pod1 scheduled on Node1. This completes a preemptive scheduling process.

avatar

Preemptive priority strategy

Next, tell us about the specific strategies and seize the preemptive process is what.

avatar

On the right side of FIG entire priority preemptive scheduling process, i.e. the workflow kube-scheduler. First, enter a time to seize the Pod, Pod will determine whether it has the eligibility to seize, there might have been the last to seize once. If you meet the qualifications to seize, it will carry out a filter for all the nodes, the nodes filter out in line with the requirements of preemption, if it does not meet the filter out these nodes.

Then filtered from the rest of the nodes, the selected nodes suitable for preemption. The preemption process simulates a dispatch, which is above the low priority first remove the Pod go out, try to seize the Pod and then be able to be placed on this node. Then choose the number of nodes through this process, go to the next process is called ProcessPreemptionWithExtenders. This is an extension of the hook, where users can add their own strategies to seize the node, if there is no extension of the hook, there is not any action.

The next process is called PickOneNodeForPreemption, it is to select the most appropriate one node from the above selectNodeForPreemption list inside, which is a certain strategy. The figure on the left a brief introduction about strategy:

  • Preferred minimum breaking PDB node;
  • Secondly, to be selected to seize the highest priority Pods minimum node;
  • Pods be selected again preemption priority and a minimum of added nodes;
  • Pods be selected next node with the smallest number of preemption;
  • Finally, select the node with the latest startup Pod;

After this five-step strategy by serial filtration, we will select the most appropriate node. Then this node to be preempted Pod for delete, thus completing a process to be preempted.

summary

A brief introduction about senior policy scheduling, and in a tight cluster resources, also reasonably scheduling resources. What do we look at things:

  • Create some custom priority class (PriorityClass);
  • Pods different configuration for different types of priority (PriorityClassName);
  • By combining different types Pods run and preemptive priority scheduling and cluster resources to make it elastic.

Guess you like

Origin www.cnblogs.com/passzhang/p/12563685.html