Yarn Scheduler Scheduler

 

Ideally, we applied a request for Yarn resources should immediately be met, but the reality resources are often limited, especially in a very busy cluster, request an application resource often need to wait for some time before to the appropriate resources . In Yarn, the application is responsible for allocating resources is the Scheduler . In fact, the scheduling itself is a problem, it is difficult to find a perfect strategy can solve all scenarios. For this purpose, Yarn offers a variety of scheduling and configurable strategy for us to choose.

Three scheduler may select the Yarn: the FIFO Scheduler, Capacity Scheduler, Fair  Scheduler .

 

FIFO Scheduler

The FIFO  Schedulerto submit the application in the order arranged in a queue, which is a first in first out the queue, during the time of the allocation of resources, give the head of the queue in most applications allocate resources, to be the head of the most satisfying application requirements give the next assignment, and so on.

 

FIFO Scheduler is the simplest and most easily understood scheduler does not require any configuration, but it does not apply to shared clusters. Large applications may consume all cluster resources, which led to other applications are blocked. In a shared cluster, more suitable Capacity Scheduler or Fair Scheduler, both big task scheduler allow access to certain small tasks and system resources at the same time submitted.

 

Capacity Scheduler

Capacity Scheduler allow multiple organizations to share the entire cluster, each cluster computing part tissue can be obtained. By assigning a dedicated queue for each organization, then the entire cluster can provide services to multiple organizations and then assign each queue certain cluster resources, so by way of a plurality of queues. In addition, the internal queue and can be divided vertically, so that more members of an organization's internal resources can share this queue, in an internal queue, scheduling of resources is to use a first in first out (FIFO) policy.

Capacity Capacity Scheduler The scheduler originally designed originally developed by Yahoo Hadoop application that can be used multiple users, and resources to maximize throughput of the entire cluster, is now being used by IBM BigInsights and Hortonworks HDP.

Capacity Scheduler is designed to allow applications to share cluster resources in a predictable and simple way, the "job queue." Capacity Scheduler is an application of the existing resources allocated to the operation according to the tenant's needs and requirements. Capacity Scheduler while allowing applications to access resources not yet being used to ensure that the share of other queues are allowed to use the resource queues. Administrators can control the capacity of each queue, Capacity Scheduler is responsible for submitting jobs to the queue.

 

​​​​​​​Fair Scheduler

In Fair scheduler, we do not need to pre-empt some system resources, Fair scheduler will run for the job all dynamically adjust system resources. As shown below, when the first big job submission, only this one job is running, this time it obtained all cluster resources; when the second small job submission, Fair scheduler will allocate resources to this small task half so that these two tasks fair share cluster resources.

Note that, in the figure Fair scheduler, task submission from the second to the availability of resources there will be some delay, because it needs to wait for the release of the first task occupied Container. Small task execution will release the resources they occupied after the completion of a large task and get all of the system resources. The final effect is that Fair Scheduler to obtain a high resource utilization but also to ensure the timely completion of small tasks.

Fair Scheduler Fair Scheduler originally developed by Facebook design makes Hadoop application can be fairly shared by multiple users across a cluster resource, it is now being used by Cloudera CDH.

Fair Scheduler not need to keep cluster resources because it will dynamically balance resources among all running jobs.

 

Example: Capacity scheduler is configured to use

Use scheduler by yarn-site.xml configuration file

yarn.resourcemanager.scheduler.class configuration parameters, defaults Capacity Scheduler scheduler .

Suppose we have the following levels of queues:

root

├── prod

└── dev

    ├── mapreduce

    └── spark

The following is a simple scheduler Capacity configuration file named Capacity-scheduler.xml . In this configuration, we define two sub-queues in a queue under root prod and dev , respectively, 40% and 60% of capacity. Note that a queue is arranged through properties . <Queue-path>. < Sub-property> yarn.sheduler.capacity specified, <Queue-path> represents the queue inheritance tree, such as root.prod queue, < sub-property> generally refers to the capacity and the maximum-capacity .

<configuration>
 <property>
    <name>yarn.scheduler.capacity.root.queues</name>
    <value>prod,dev</value>
  </property>
 <property>
    <name>yarn.scheduler.capacity.root.dev.queues</name>
    <value>mapreduce,spark</value>
  </property>
    <property>
    <name>yarn.scheduler.capacity.root.prod.capacity</name>
    <value>40</value>
  </property>
    <property>
    <name>yarn.scheduler.capacity.root.dev.capacity</name>
    <value>60</value>
  </property>
    <property>
    <name>yarn.scheduler.capacity.root.dev.maximum-capacity</name>
    <value>75</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.dev.mapreduce.capacity</name>
    <value>50</value>
  </property>
   <property>
    <name>yarn.scheduler.capacity.root.dev.spark.capacity</name>
    <value>50</value>
  </property>
</configuration>

We can see, dev queue has been divided into mapreduce and spark two sub-queues the same capacity. dev 's maximum-capacity property is set to become 75%, so even prod queue completely free dev will not take up all cluster resources, that is to say, prod queue is still 25% of the available resources for emergency response. We note, mapreduce and spark two queues is not set maximum-capacity property, that is to say mapreduce or spark queue job might use the entire dev all resources queues (up to 75% of the cluster). While similar, Prod because there is no set maximum-capacity property, it may take all cluster resources.

About setting up the queue, depending on our specific application. For example, MapReduce, we can mapreduce.job.queuename specify the use of queue properties. If the queue does not exist, we will receive an error when submitting the task. If we do not define any queue, all applications will be placed in a default queue.

Note: For the Capacity Scheduler, our queue name must be the last part of the queue tree , if we use the Queue tree is not recognized. For example, in the configuration above, we use the prod and mapreduce as the queue name is possible, but if we use root.dev.mapreduce or dev. Mapreduce is invalid.

 

 

 

 

Published 176 original articles · won praise 278 · views 80000 +

Guess you like

Origin blog.csdn.net/weixin_43893397/article/details/105048569