Explain Mesos in simple terms (4): resource allocation of Mesos

An important function of Apache Mesos to be the best data center resource manager is that it has the ability to dredge like a traffic policeman in the face of various types of applications. This article will go deep into the resource allocation of Mesos and discuss how Mesos balances and fair resource sharing according to customer application needs. Before starting, if the reader has not read the preambles of this series, it is recommended to read them first. The first is an overview of Mesos , the second is a description of the two-level architecture , and the third is data storage and fault tolerance .

We'll explore Mesos' resource allocation module to see how it determines what resource offers to send to which framework, and how to reclaim resources when necessary. Let's first review the task scheduling process of Mesos:

From the description of the two-level architecture mentioned above , we know that the scheduling of Mesos Master agent tasks first collects information about available resources from slave nodes, and then provides these resources to the framework registered on it in the form of resource invitations.

Framework can choose to accept or reject resource invitations according to whether they meet the constraints of tasks on resources. Once the resource offer is accepted, the Framework will coordinate with the Master to schedule tasks and run the tasks on the corresponding slave nodes in the data center.

The decision on how to make resource invitations is realized by the resource allocation module, which exists in the Master. The resource allocation module determines the order in which frameworks accept resource offers, and at the same time, ensures that resources are shared fairly among greedy frameworks. In a homogeneous environment, such as a Hadoop cluster, one of the most used fair share allocation algorithms is max-min fairness. Max Min Fair Algorithm The algorithm maximizes the smallest allocation of resources and delivers it to users, ensuring that each user gets a fair share of the resources needed to meet their needs; a simple example to illustrate how this works , please refer to Example 1 of the Max Min Fair Share Algorithm page . As mentioned, in a homogeneous environment this usually works well. Resource requirements in a homogeneous environment have little fluctuation, and the types of resources involved include CPU, memory, network bandwidth, and I/O. However, resource allocation will be more difficult when resources are scheduled across data centers and the resource requirements are heterogeneous. For example, when each task of user A requires 1-core CPU and 4GB of memory, and each task of user B requires 3-core CPU and 1GB of memory, how to provide an appropriate fair share allocation strategy? When user A's tasks are memory-intensive and user B's tasks are CPU-intensive, how to fairly allocate a basket of resources to it?

Because Mesos is specialized in managing resources in a heterogeneous environment, it implements a pluggable resource allocation module architecture, giving users the most suitable allocation strategies and algorithms for specific deployments. For example, users can implement a weighted max-min fairness algorithm to allow a specified framework to obtain more resources than other frameworks. By default, Mesos includes a strict priority resource allocation module and a modified fair share resource allocation module. The algorithm implemented by the strict priority module gives the Framework a priority so that it always accepts and accepts resource offers sufficient to satisfy its task requirements. This ensures that critical applications limit the overhead of dynamic resource shares in Mesos, but will potentially starve other frameworks.

For these reasons, most users default to DRF (Dominant Resource Fairness), an improved fair share algorithm in Mesos that is more suitable for heterogeneous environments.

DRF comes from the Berkeley AMPLab team like Mesos and is coded as Mesos' default resource allocation strategy.

Readers can read DRF's original paper from here and here . In this article, I will summarize the main points and provide some examples, which I believe will provide a clearer interpretation of DRF. Let's start the journey of demystification.

The goal of DRF is to ensure that every user, the Framework in Mesos, receives its fair share of the resources it needs most in a heterogeneous environment. In order to grasp DRF, we need to understand the concepts of dominant resource and dominant share. A Framework's dominant resource is the resource type it needs most (CPU, memory, etc.), which is shown in the resource offer as a percentage of available resources. For example, for compute-intensive tasks, its Framework's dominant resource is the CPU, while for tasks that rely on in-memory computations, its Framework's dominant resource is memory. Because resources are allocated to Frameworks, DRF keeps track of the percentage share of resource types owned by each Framework; the highest percentage of all resource type shares owned by a Framework is the Framework's dominant share. The DRF algorithm uses all registered Frameworks to calculate the dominant share to ensure that each Framework receives its fair share of its dominant resources.

Is the concept too abstract? Let's illustrate with an example. Suppose we have a resource offer with a 9-core CPU and 18GB of memory. Framework 1 needs to run tasks (1-core CPU, 4GB memory), and Framework 2 needs to run tasks (3-core CPU, 1GB memory)
. Each task of Framework 1 consumes 1/9 of the total CPU and 2/9 of the total memory, so Framework 1's dominant resource is memory. Similarly, each task of Framework 2 will be 1/3 of the total CPU and 1/18 of the total memory, so the dominant resource of Framework 2 is the CPU. DRF will try to give each Framework an equal amount of dominant resources as their dominant share. In this example, DRF will cooperate with Framework to do the following assignments: Framework 1 has three tasks, with a total allocation of (3-core CPU, 12GB memory), and Framework 2 has two tasks, with a total allocation of (6-core CPU, 2GB memory).

At this point, each Framework's dominant resource (Framework 1's memory and Framework 2's CPU) ends up getting the same dominant share (2/3 or 67%), so that when given to both Frameworks, there won't be enough available resources to run other tasks. It should be noted that if only two tasks in Framework 1 need to be run, then Framework 2 and other registered Frameworks will receive all remaining resources.

So, how is DRF calculated to produce the above results? As mentioned earlier, the DRF allocation module tracks the resources allocated to each framework and the dominant share of each framework. Each time, the DRF sends the Framework as a resource offer with the lowest dominant share among all tasks running in the Framework. The Framework will accept the offer if there are enough resources available to run its tasks. Let's walk through each step of the DRF algorithm using examples from the DRF paper cited earlier. For simplicity, the example will not consider the fact that the resource is released back to the resource pool after the short task is completed, we assume that each Framework will have an infinite number of tasks to run, and assume that each resource offer will be accepted.

Recalling the above example, suppose there is a resource offer with a 9-core CPU and 18GB of memory. Tasks run by Framework 1 require (1-core CPU, 4GB memory), and tasks run by Framework 2 require (3-core CPU, 2GB memory). Framework 1 tasks consume 1/9 of the total CPU and 2/9 of the total memory. The dominant resource of Framework 1 is memory. Similarly, each task of Framework 2 will be 1/3 of the total CPU and 1/18 of the total memory. The dominant resource of Framework 2 is the CPU.

Each row in the table above provides the following information:

  • Framework chosen - Framework that receives the latest resource invitations.
  • Resource Shares - The total number of resources accepted by the Framework at a given time, including CPU and memory, expressed as a percentage of the total resources.
  • Dominant Share - The proportion of the Framework's dominant resources to the total share at a given time, expressed as a proportion of the total resources.
  • Dominant Share % - The percentage of the Framework's dominant resources to the total share at a given time, expressed as a percentage of the total resources.
  • CPU Total Allocation - The total CPU resources of all frameworks accepted at a given time.
  • RAM Total Allocation - The total memory resources of all frameworks accepted at a given time.

Note that the lowest dominant share in each row is in bold for easy finding.

Initially, the dominant share of the two Frameworks is 0%, we assume that DRF chooses Framework 2 first, of course we can also assume Framework 1, but the final result is the same.

  1. Framework 2 receives the share and runs the task, making its dominant resource the CPU, and the dominant share increases to 33%.
  2. As Framework 1's dominant share remains at 0%, it receives shares and runs tasks, and the dominant share's dominant resource (memory) increases to 22%.
  3. Since Framework 1 still has a low dominant share, it takes the next share and runs the task, increasing its dominant share to 44%.
  4. The DRF then sends the resource offer to Framework 2 as it now has a lower dominant share.
  5. The process continues until new tasks cannot be run due to lack of available resources. In this case, CPU resources are saturated.
  6. The process will then repeat with a new set of resource offers.

Note that it is possible to create a resource allocation module that uses weighted DRFs to favor a certain framework or group of frameworks. As mentioned earlier, some custom modules can also be created to provide organization-specific allocation strategies.

In general, most tasks these days are short-lived, and Mesos can wait for the task to complete and reallocate resources. However, it is also possible to run long-running tasks on the cluster that deal with hung jobs or misbehaving Frameworks.

It is worth noting that the resource allocation module has the ability to cancel tasks when the speed of resource release is not fast enough. Mesos attempts to undo tasks by sending a request to the executor to end the specified task, and giving the executor a grace period to clean up the task. If the executor does not respond to the request, the allocation module ends the executor and all tasks on it.

Allocation policies can be implemented to prevent revocation of specified tasks by providing guaranteed configuration associated with the Framework. If a Framework falls below the guaranteed configuration, Mesos will not be able to end tasks for that Framework.

We still need to learn more about Mesos resource allocation, but I'll stop here. Next, I'm going to say something different, about the Mesos community. I believe this is an important topic to consider because open source includes not only technology but also community.

After talking about the community, I will write some step-by-step tutorials on Mesos installation and Framework creation and use. After some hands-on teaching articles, I will come back to do some more in-depth topics, such as how Framework interacts with Master, how Mesos works across multiple data centers, etc.

As always, I encourage readers to provide feedback, especially about what I've marked, and if you find something wrong, please let me know. I am not omniscient and ask for advice with an humility, so I am very much looking forward to the correction and enlightenment of readers. We can also communicate on twitter , follow @hui_kenneth.

 

http://www.infoq.com/cn/articles/analyse-mesos-part-04/

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326605036&siteId=291194637