On the principle of distributed cluster management system

This article originated from a personal public account : TechFlow , original is not easy, seek attention


Today is the twelfth article of the distributed topic . Let us continue to look at the cluster resource management system.

In the last article, we briefly learned what is distributed cluster resource management, its birth background and what problems it solves, and what are its advantages and disadvantages. The content of the previous chapter is relatively superficial, without too much in-depth principle. In this article, let's take a look at the principle part of the cluster management system .

If you have forgotten an article, or if you are the latest follower , you can click the portal below to review the previous article.

On the principle of distributed cluster resource management system


Local priority


In the context of big data applications, there is a basic design principle: we usually assign the calculation to the node that stores the data for execution , rather than get the required data from the node and then perform the calculation. The reason behind this is easy to figure out, because it can minimize network communication between nodes and reduce data transmission. It is important to know that the scale of data in the big data scenario is very large, ranging from TB, PB, and sometimes hundreds to tens of GB. Once the network transmission data is required, the overhead is very considerable.

We summarize this principle, which can be called the " local priority " principle, that is, the more local the node performing the task, the better , if it is too perfect in a physical machine, because it can avoid all data transmission.

We can simply measure the capacity of a cluster scheduling system according to this principle, and we can simply list three levels according to the locality. The first is node locality , that is, all calculations take place in one node, so that all data transmission can be avoided, which is also the best case. The second difference is called rack locality , which means that although all calculations cannot be performed in one node, at least they can be guaranteed that the executing machines are in the same rack. The machines in the rack can transfer data internally without having to resort to external networks, and such transmission is much faster. The worst case is that the nodes are scattered in different racks and there is no way to do any acceleration. This is called global locality and will bring a lot of overhead.

We all know that a good cluster scheduling system must " squeeze " the performance of all the machines in the cluster to the extreme, but in fact the computing resources consumed by fixed computing are basically stable. In addition to the utilization rate of the machine , another key point is this. And sometimes the overhead caused by network IO may be more terrible than the low usage of the machine.


Resource allocation details


About resource allocation, our intuitive understanding may be very simple. When a machine is idle, we will assign it to a task, and then execute it. We will recover the resource, but there are still many details in the actual application that require careful design and thinking . A little bit of failure may cause serious consequences.

Let's look at two questions. The first question is what should we do when we have a new task now, but there are not enough resources ?

It is easy to think of two strategies, the first strategy is to do nothing, and so on . After some tasks are over and resources are released, we will execute the current task. But what if our current mission is a very urgent one? It is possible that the task being executed is not of high priority, but the resources occupied are long and long. Do we have to wait forever?

So we will think of a second-party strategy, which is robbing . Since our current resources have high priority, the tasks we are performing have low priority. We can grab some resources from the tasks with low priority first, and then execute the important tasks now. Wait for the high priority task to execute before executing the low priority task. Unfortunately, this is problematic. Let ’s not talk about the technical issues first. We ’ll talk about it later. Have you ever thought about it ? Who would be willing to lower the priority of the task ? Be sure that everyone ’s tasks are of the highest priority. Later, you will find that the so-called priority settings have all been changed to display, and the running priorities are all the same, all being the highest priority.

Regarding the above question, there are follow-ups. Let's put it aside and look at another question. This question is a derivative of the above question. Different tasks require different resources and require different resources. For example, some task resources can be run as many as possible, such as spark or MapReduce. If more resources are used, more machines are used, and fewer resources are used less, but no matter how many can be run, it is nothing more than the execution time. But some tasks are not like this. For example, the task of machine learning may require a lot of resources at one time , and it needs so many. Then the question is coming, when we have a new task, when the current resources are not enough to meet its allocation needs, do we first leave it unallocated, and wait for the resources to be allocated at one time, or allocate a little first, and then take Continue to allocate resources to a little?

You see, the seemingly bland distribution strategy still has a lot of tricky details . This is indeed the case, which is why I said that the current cluster resource management system is far from mature and has just started, because there is no particularly good solution to the above problems.


Starvation and deadlock


Remember the two questions above, if these two questions are not answered well, there will be starvation and deadlock.

Starvation means that tasks cannot be scheduled all the time , for example because the priority set is unreasonable. Only you, an honest child, set a normal priority for a task that you think is not very important, while other old drivers have set the highest priority for their tasks. Since high-priority tasks have always been submitted, your task has been delayed, and you think you will get results soon. Your task may not be executed until after work. In this case, you will either confluence and set your tasks to the highest priority, or you will have to wait forever, or you will face pressure from the performance and the boss because there is not enough work done.

So, this has risen from a simple task scheduling problem to the test of inner values, which is also a classic process of bad money expelling good money . Due to the non-observance of the rules by a small number of people, those who abide by the rules are punished. Think of the real tears of men and women [dog head].

The situation of deadlock is relatively easy to understand. If the students who have learned the operating system should be familiar with it, the principle is the same. For example, if we currently have two tasks AB, both tasks require 2/3 of the cluster resources to execute. Since these two tasks were submitted at almost the same time, the system adopted the principle of first-come-first-served, and allocated 1/2 resources to both tasks. Then this will cause a deadlock, because neither of these two tasks can be executed, and neither task will release resources, so unless one is manually killed, it will continue to do so.

From the current situation, it seems that there is no perfect way to completely avoid these two situations. The architect can only make decisions based on the actual situation of his own cluster call, and also add some human intervention factors, such as formulating some specifications and regulations in the team about three chapters of the law. In other words, to a certain extent, this is not just a system problem, but a complex problem of coordination between the system and the team .


scheduler


Let's take a look at the architecture of common schedulers. There are three common schedulers. The first is a centralized scheduler , the second is a two-level scheduler, and the last is a state-sharing scheduler .


Centralized scheduler

Let's first look at the centralized scheduler, which is the most intuitive and simple:

Its design logic is centralized , and there is only one global central dispatcher in the entire system. All frameworks or computing tasks are implemented by the central scheduler. It's kind of like a clan system in feudal times. All the big and small things in the whole big family are managed by one person. Obviously, there are a lot of disadvantages in doing this. The two problems just mentioned require human intervention, otherwise it is difficult to solve.

Later, improvements were made on this basis, and branch logic was added to the entire central dispatcher . This scheduler is called a multipath scheduler:

On the whole, there has been little change, only one more conditional judgment. In other words, different scheduling and allocation strategies for different types of tasks are implemented internally . For example, if it is a small debris task, then perform priority management, first come first served strategy. If it is a large machine learning task, it will only be executed if it gets complete resources to prevent deadlocks and so on.

Compared to a single path in terms of centralized scheduling, centralized scheduling multi-path adds some flexibility, but overall scalability is not enough, and concurrent capacity is relatively poor, resource utilization is not high enough, if a large scale, and scheduling Performance can easily become a bottleneck . However, it is simple in structure and easy to maintain.


Two-level scheduler

Because the centralized scheduler has many problems and is not flexible enough, in order to increase its flexibility, we have added another layer of structure on this basis:

We also have a central scheduler to take command, but the central scheduler does not schedule tasks directly, but allocates resources in the cluster to the framework scheduler with a relatively coarse-grained strategy . The logic of scheduling and executing tasks is in the framework scheduler. Compared with the central scheduler, the strategy implemented by the framework scheduler will be more fine-grained.

In addition, only the central scheduler can see the situation of all resources in the entire cluster, and the framework scheduler can only see the part of resources that it has been allocated. YARN and Mesos are familiar with this architecture.

With the framework scheduler, we can execute different strategies in different framework schedulers. Helps improve the concurrency capability of the entire cluster, as well as resource utilization. So overall, the performance of the two-stage scheduler is much better than the centralized scheduler.

But even this is not perfect. Because the central scheduler executes a pessimistic concurrency strategy when scheduling . The simple explanation is that in the process of performing allocation, the central scheduler will strictly follow the order established in advance, and will lock resources to prevent conflicts caused by different frameworks applying for resources. Since pessimistic locking is used, it will obviously affect the overall performance.


State sharing scheduler

The architecture of the state sharing scheduler is very close to the two-level scheduler, which can be simply understood as the result of removing the central scheduler :

This architecture first appeared in Google's Omega scheduling system, which is now the predecessor of the hot Kubernetes. The biggest difference between it and the two-level scheduler is that there is no central scheduler, and all frame schedulers can directly see all resources in the entire cluster . When resources are needed, the framework schedulers compete with each other to obtain them.

And unlike the central scheduler, optimistic locking is used in the state sharing strategy . Briefly explain the difference between optimistic locking and pessimistic locking. Pessimistic locks often assume the worst case. For example, after acquiring resources at the moment, there may be other threads to access or modify before the end of use, so we need to add locks to avoid such situations.

Optimistic locking is the opposite, based on optimistic assumptions, which means that the system is based on the premise that it can run smoothly without resource preemption. That is to say, execute first. If preemption or other problems occur after the execution, then try again to solve it through other mechanisms .

Even in high concurrency scenarios, resource conflicts are a relatively small probability time. If pessimistic locking is used, it will obviously bring a lot of locking overhead , so the design based on optimistic locking will make the system's concurrent performance stronger. But this also comes at a price, and optimistic locking is not perfect. If a large number of competition conflicts really occur, the party that fails in competition often needs to retry the task, which will bring a lot of unnecessary overhead and will also cause resources Waste .

In addition, since the preemption between all frameworks is free, that is to say, it may happen that high-priority frameworks have been preempting resources, and low-priority tasks are starved to death. Under this mechanism, there is no way to ensure fairness between tasks . This is also the inevitable consequence of weakening the central dispatcher.


to sum up


If we review the above three strategies, we will find that the evolutionary order of these three strategies is actually the order in which the central dispatcher is weakened . It is actually easy to understand that the powerful central scheduler can maintain the fairness of the entire cluster, but due to its low efficiency, it is likely to become a bottleneck for the entire cluster. The weaker the central scheduler, the higher the degree of freedom of the frame scheduler, and the greater the flexibility of the entire system scheduling, which means that the performance of the system is often better.

Some people have made an analogy. The centralized scheduler is somewhat like a planned economy . Everything is planned by the state. The advantage is that fairness can be guaranteed, but the degree of freedom and flexibility are poor, and the operation and development of the entire country are not efficient. The two-level dispatcher is somewhat like a mixed model of a large government and a small market . The government's intervention is still very strong, but it has a little more freedom. The state sharing device is a free competition economic model of a small government and a large market . The government's intervention is almost gone, and it has become an invisible hand, which can further improve flexibility and the country's operating efficiency. But there is less state intervention, and when the risk comes, it may also cause big problems.

To a certain extent, just as the system of the national society is not perfect, the cluster scheduling strategy does not have a perfect one at present. Each has its own advantages and special strengths. The best solution, which is why we learn these underlying principles rather than just staying on how to use them.

Today's article is just that. If you feel something rewarded, please follow or repost it. Your effort is very important to me.

Insert picture description here

Published 117 original articles · Like 61 · Visits 10,000+

Guess you like

Origin blog.csdn.net/TechFlow/article/details/105451337