TKE User Story | Job Help Kubernetes Native Scheduler Optimization Practice

author

Lu Yalin, who joined Jobbang in 2019, is responsible for the R&D of the Jobbang architecture. During the Jobbang period, he led the evolution of cloud native architecture, promoted the implementation of containerization transformation, service governance, GO microservice framework, and the implementation of DevOps.

Introduction

The essence of the scheduling system is to match the appropriate resources for computing services/tasks, so that they can run stably and efficiently, and on this basis to further increase the density of resource usage, and there are many factors that affect the operation of applications, such as CPU, memory, IO , differentiated resource equipment and a series of factors will affect the performance of the application. At the same time, factors such as individual and overall resource requests, hardware/software/policy restrictions, affinity requirements, data areas, interference between loads, and interweaving of different application scenarios such as periodic traffic scenarios, computationally intensive scenarios, and offline mixing It also brings changes in decision-making.

The goal of the scheduler is to achieve this capability quickly and accurately, but the two goals of speed and accuracy often conflict in resource-limited scenarios, and a trade-off between the two is required.

Scheduler Principle and Design

The overall working framework of the K8s default scheduler can be simply summarized in the following figure:

two control loops

  1. The first control loop, called Informer Path. Its main job is to start a series of Informers to monitor (Watch) changes in scheduling-related API objects such as Pod, Node, and Service in the cluster. For example, after a to-be-scheduled Pod is created, the scheduler will add the to-be-scheduled Pod to the scheduling queue through the Handler of the Pod Informer; at the same time, the scheduler is also responsible for updating the Scheduler Cache, and using This cache is for reference information to improve the performance of the entire scheduling process.

  2. The second control loop, the main loop for scheduling pods, is called the Scheduling Path. The workflow of this cycle is to continuously take out the pods to be scheduled from the scheduling queue, and run a two-step algorithm to select the optimal node.

  • Among all nodes in the cluster, select all nodes that "can" run the pod, this step is called Predicates;

  • Among the nodes selected in the previous step, the nodes are scored according to some column optimization algorithms, and the "best" node with the highest score is selected. This step is called Priorities.

After the scheduling is completed, the scheduler will assign the node to the spec.NodeName of the pod, which is called Bind. In order not to access the Api Server in the main process path and affect the performance, the scheduler will only update the relevant pod and node information in the Scheduler Cache: this Api object update method based on optimistic assumptions is called Assume in K8s. After that, a goroutine will be created to asynchronously initiate the update Bind operation to the Api Server. It doesn't matter if this step fails, and everything will be normal after the Scheduler Cache is updated.

Problems and challenges brought by large-scale cluster scheduling

The default scheduler strategy of K8s has excellent performance in small-scale clusters, but with the increase in the level of business and the diversity of business types, the default scheduling strategy gradually shows its limitations: fewer scheduling dimensions, no concurrency, There are performance bottlenecks, and the scheduler is getting more complex.

So far, our current single cluster scale has thousands of nodes, pods are more than 10w, and the overall resource allocation rate exceeds 60%, including gpu, in complex scenarios such as offline hybrid deployment; in this process, we Encountered a lot of scheduling problems.

Problem 1: Uneven node load during peak hours

The default scheduler refers to the request value of the workload. If we set the request too high, it will lead to waste of resources; if it is too low, it may bring about serious differences in CPU imbalance during peak periods; although the use of affinity strategy This can be avoided to a certain extent, but a large number of policies need to be populated frequently, and the maintenance cost will be very large. Moreover, the request of the service often cannot reflect the real load of the service, resulting in discrepancies. And this difference error will be reflected in the uneven load of nodes during peak hours.

The real-time scheduler obtains real-time data of each node during scheduling to participate in node scoring, but in fact, real-time scheduling is not applicable in many scenarios, especially for services with obvious regularity; for example, most of our services serve evening peak traffic It is dozens of times of the usual traffic, and the difference between high and low peak resource usage is huge. However, the business version generally chooses the low-peak version and uses the real-time scheduler, which is often balanced when publishing, and there is a huge difference between nodes in the evening peak. Many real-time schedulers often use the rebalancing strategy to reschedule when there is a huge difference. It is unrealistic to consider the high availability of services when migrating service PODs during peak hours. Obviously, real-time scheduling is far from meeting business scenarios.

Our solution: scheduling at peak forecast time

Therefore, in response to this situation, predictive scheduling is required. According to the usage of CPU, IO, network, logs and other resources at peak times in the past, the weight coefficients of each service and resource are obtained by performing regression calculation on the optimal arrangement and combination of services on the node. , resource-based weighted scoring expansion, that is, using past peak data to predict future peak node service usage, thereby intervening in scheduling node scoring results.

Problem 2: Diversification of scheduling dimensions

As businesses become more diverse, more scheduling dimensions, such as logs, need to be added. Because the collector cannot collect logs at an unlimited rate and log collection is based on the node dimension. It is necessary to balance the log collection rate, and the difference between each node cannot be too large. The CPU usage of some services is general but the log output is large; and the log is not part of the default scheduler decision, so when these pods serving multiple services with a large amount of logs are on the same node, the machine There may be a partial delay in the log reporting on the .

Our Solution: Completing Scheduling Decision Factors

This problem obviously requires us to complete the scheduling decision. We expand the prediction scheduling scoring strategy, add the decision factor of the log, use the log as a resource of the node, and obtain the log usage corresponding to the service according to historical monitoring to calculate Fraction.

Problem 3: Scheduling delay caused by large-scale service expansion and contraction

With the further increase in the complexity of the business, there will be a large number of scheduled tasks and a large amount of elastic scaling in peak hours. The simultaneous scheduling of large batches (thousands of PODs) leads to an increase in the scheduling delay, both of which are sensitive to the scheduling time. , especially for timed tasks, the increase in scheduling delay will be clearly perceived. The reason is that the K8s scheduling pod itself is the allocation of cluster resources, and the response in the scheduling process is that the pre-selection and scoring stages are carried out sequentially; in this way, when the scale of the cluster reaches a certain level, large-scale updates will appear possible. Perceived pod scheduling delay.

Our solution: split the task scheduler, increase the concurrent scheduling domain, batch scheduling

The most direct way to solve the low throughput capability is to change from serial to parallel. For resource preemption scenarios, try to refine resource domains as much as possible, and parallel resource domains. Given the above strategies, we split out an independent job scheduler and used serverless as the underlying resource for job running. For each JOB POD, K8s serverless applies for a separate POD to run sanbox, that is, the task scheduler, which is completely parallel. The following comparison chart :

Native scheduler node CPU usage during evening peak hours

The optimized scheduler's node CPU usage during evening peak hours

Summarize

Work node resources, GPU resources, and serverless resources are the three types of resource domains that our cluster heterogeneous resources belong to. There are natural differences in the services running on these three resources. We use forecast-scheduler, gpu-scheduler, job-scheduler schedule Three schedulers to manage the scheduling of pods on these three resource domains.

The forecast scheduler manages most of the online business, which expands the resource dimension and adds a forecast scoring strategy.

The GPU scheduler manages the allocation of GPU resources to machines, runs online inference and offline training, and the ratio of the two fluctuates for a long time. During peak hours, offline training will shrink and online reasoning will expand; during off-peak hours, offline training will expand and online reasoning will shrink. Handle some offline image processing tasks to reuse resources such as idle CPUs on GPU machines

The Job scheduler is responsible for managing the scheduling of our timed tasks. The amount of timed tasks is large and the creation and destruction are frequent. The use of resources is very fragmented, and the requirements for effectiveness are higher. Therefore, we try to schedule tasks to serverless services, and compress the cluster in order to be able to To accommodate a large number of tasks and redundant machine resources, improve resource utilization.

Discussion on future evolution

More fine-grained resource domain division

The resource domain is divided into node level, and the node level is locked.

Resource preemption and rescheduling

In normal scenarios, when a pod fails to be scheduled, the pod will remain in the pending state, waiting for pod updates or changes in cluster resources to be rescheduled. However, the K8s scheduler still has a preemption function, which enables high-priority pods to be rescheduled. When scheduling fails, squeeze out some low-priority pods on a node to ensure the normality of high-priority pods. So far, we have not used the preemption capability of the scheduler, even if we use the above strategies to enhance the accuracy of scheduling , but still cannot avoid the unbalanced situation caused by business in some scenarios. In such abnormal scenarios, the ability of rescheduling will come into play. Perhaps rescheduling will become an automatic repair for abnormal scenarios in the future. The way.

about Us

For more cases and knowledge about cloud native, you can pay attention to the public account of the same name [Tencent Cloud Native]~

Welfare:

① Reply to the [Manual] in the background of the official account, you can get the "Tencent Cloud Native Roadmap Manual" & "Tencent Cloud Native Best Practices"~

②The official account will reply to the [series] in the background, and you can get "15 series of 100+ super practical cloud native original dry goods collection", including Kubernetes cost reduction and efficiency enhancement, K8s performance optimization practices, best practices and other series.

③If you reply to the [White Paper] in the background of the official account, you can get the "Tencent Cloud Container Security White Paper" & "The Source of Cost Reduction - Cloud Native Cost Management White Paper v1.0"

[Tencent Cloud Native] New products of Yunshuo, new techniques of Yunyan, new activities of Yunyou, and information of cloud appreciation, scan the code to follow the public account of the same name, and get more dry goods in time! !

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324133666&siteId=291194637