Parallel processing problem of Yarn capacity scheduler optimization

0 Preface

Yarn's default scheduler is Capacity Scheduler, and there is only one queue by default-default. If there are not enough resources to execute the first task in the queue, the second task will not be executed, and it will wait until the first task is completed.

1 experiment

(1) Start a hive client and execute the following SQL statement to insert data.

hive (default)> insert into table student values(1,'abc');

Executing this statement, hive will initialize a Spark Session to execute hive on spark tasks. Since the queue is not specified, the default queue is occupied by the Spark Session by default , and the queue will be occupied until the hive client is exited.

You can visit the web page of ResourceManager to view related information.

(2) Submit an MR with the hive client open.

[root@hadoop102 ~]$ hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar pi 1 1

The MR task also does not specify a queue, so it is also submitted to the default queue by default, because the parallelism of a single queue of the capacity scheduler is 1. Therefore, the MR task submitted later will wait forever and cannot start execution.

The task submission interface is as follows:

The web page of ResourceManager is as follows:

(3) In the default queue of the capacity scheduler, only one task is executed at the same time, and the concurrency is low. How to solve it?

Solution 1: Increase the proportion of ApplicationMaster resources, thereby increasing the number of running apps.

Solution 2: Create multiple queues, such as adding a hive queue.

2 Increase the proportion of ApplicationMaster resources

For the problem of low concurrency of the capacity scheduler, consider adjusting the parameter yarn.scheduler.capacity.maximum-am-resource-percent. The default value is 0.1, which represents the maximum proportion of resources that can be used by AM on the cluster. The purpose is to limit the number of excessive apps.

(1) Modify the following parameter values ​​in the /opt/module/hadoop-3.1.3/etc/Hadoop/capacity-scheduler.xml file of hadoop102

[root@hadoop102 hadoop]$ vim capacity-scheduler.xml

<property>

    <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>

    <value>0.5</value>

    <description>

      集群中用于运行应用程序ApplicationMaster的资源比例上限,

该参数通常用于限制处于活动状态的应用程序数目。该参数类型为浮点型,

默认是0.1,表示10%。所有队列的ApplicationMaster资源比例上限可通过参数

yarn.scheduler.capacity.maximum-am-resource-percent设置,而单个队列可通过参数yarn.scheduler.capacity.<queue-path>.maximum-am-resource-percent设置适合自己的值。

    </description>

  </property

(2) Distribute the capacity-scheduler.xml configuration file

[root@hadoop102 hadoop]$ xsync capacity-scheduler.xml

(3) Close the running task and restart the yarn cluster

[root@hadoop103 hadoop-3.1.3]$ sbin/stop-yarn.sh

[root@hadoop103 hadoop-3.1.3]$ sbin/start-yarn.sh

4 Increase the Yarn capacity scheduler queue

Solution 2: Creating multiple queues can also increase the concurrency of the capacity scheduler.

How to configure multiple queues in the enterprise:

Create queues hive, spark, flink according to the calculation engine

Create queues according to business: place orders, pay, like, comment, favorites (users, activities, discounts related)

what is the benefit?

If the company comes to a rookie and writes a recursive loop, the company's cluster resources are exhausted, and the big data is all paralyzed.

Decoupling.

If the amount of data in 11.11 is very large and there are many tasks, if all tasks are involved in running, they must not be executed. What should I do?

Can support degraded operation.

Order √

       Payment√

              Like X

1 Increase the capacity scheduler queue

(1) Modify the configuration file of the capacity scheduler

In the default Yarn configuration, the capacity scheduler has only one default queue. You can configure multiple queues in capacity-scheduler.xml, modify the following attributes, and add hive queues.

<property>

    <name>yarn.scheduler.capacity.root.queues</name>

    <value>default,hive</value>

    <description>

     再增加一个hive队列

    </description>

</property>


<property>

    <name>yarn.scheduler.capacity.root.default.capacity</name>

<value>50</value>

    <description>

      default队列的容量为50%

    </description>

</property>

同时为新加队列添加必要属性:

<property>

    <name>yarn.scheduler.capacity.root.hive.capacity</name>

<value>50</value>

    <description>

      hive队列的容量为50%

    </description>

</property>


<property>

    <name>yarn.scheduler.capacity.root.hive.user-limit-factor</name>

<value>1</value>

    <description>

      一个用户最多能够获取该队列资源容量的比例,取值0-1

    </description>

</property>


<property>

    <name>yarn.scheduler.capacity.root.hive.maximum-capacity</name>

<value>80</value>

    <description>

      hive队列的最大容量(自己队列资源不够,可以使用其他队列资源上限)

    </description>

</property>


<property>

    <name>yarn.scheduler.capacity.root.hive.state</name>

    <value>RUNNING</value>

    <description>

      开启hive队列运行,不设置队列不能使用

    </description>

</property>


<property>

    <name>yarn.scheduler.capacity.root.hive.acl_submit_applications</name>

<value>*</value>

    <description>

      访问控制,控制谁可以将任务提交到该队列,*表示任何人

    </description>

</property>


<property>

    <name>yarn.scheduler.capacity.root.hive.acl_administer_queue</name>

<value>*</value>

    <description>

      访问控制,控制谁可以管理(包括提交和取消)该队列的任务,*表示任何人

    </description>

</property>


<property>

    <name>yarn.scheduler.capacity.root.hive.acl_application_max_priority</name>

<value>*</value>

<description>

      指定哪个用户可以提交配置任务优先级

    </description>

</property>


<property>

    <name>yarn.scheduler.capacity.root.hive.maximum-application-lifetime</name>

<value>-1</value>

    <description>

      hive队列中任务的最大生命时长,以秒为单位。任何小于或等于零的值将被视为禁用。

</description>

</property>

<property>

    <name>yarn.scheduler.capacity.root.hive.default-application-lifetime</name>

<value>-1</value>

    <description>

      hive队列中任务的默认生命时长,以秒为单位。任何小于或等于零的值将被视为禁用。

</description>

</property>

(2) Distribute configuration files

[root@hadoop102 ~]$ xsync /opt/module/hadoop-3.1.3/etc/hadoop/capacity-scheduler.xml

(3) Restart the Hadoop cluster

2 Test the new queue

(1) Submit an MR task and specify the queue as hive

[root@hadoop102 ~]$ hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar pi -Dmapreduce.job.queuename=hive 1 1

(2) View the ResourceManager web page and observe the queue to which the task is submitted

Guess you like

Origin blog.csdn.net/godlovedaniel/article/details/108583602