How to enable and configure YARN

How to enable and configure YARN

Apache Hadoop's YARN (Yet Another Resource Negotiator) is a cluster resource manager used to allocate and manage resources in the cluster. This article will describe how to enable and configure YARN.

1. YARN configuration file

Before you start configuring YARN, you need to create the configuration files required by YARN based on your cluster environment. The following are several important files involved in YARN configuration:

  • yarn-site.xml: This is the main configuration file of YARN. It contains the configuration items of each component of YARN, such as ResourceManager, NodeManager, history server, etc.
<configuration>
  <!-- ResourceManager 相关配置 -->
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>your_resourcemanager_hostname</value>
    <description>指定 ResourceManager 所在节点的主机名</description>
  </property>
  <property>
    <name>yarn.resourcemanager.address</name>
    <value>your_resourcemanager_address:port</value>
    <description>指定 ResourceManager 的连接地址</description>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>your_resourcemanager_address:port</value>
    <description>指定 ResourceManager 的调度器连接地址</description>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.address</name>
    <value>your_resourcemanager_webapp_address:port</value>
    <description>指定 ResourceManager 的 Web 应用连接地址</description>
  </property>

  <!-- NodeManager 相关配置 -->
  <property>
    <name>yarn.nodemanager.local-dirs</name>
    <value>/path/to/local/dirs</value>
    <description>指定本地目录以存储 NodeManager 的运行数据和临时文件</description>
  </property>
  <property>
    <name>yarn.nodemanager.log-dirs</name>
    <value>/path/to/log/dirs</value>
    <description>指定日志文件的存储目录</description>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
    <description>指定附加服务的名称,如 MapReduce 的 shuffle 服务</description>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    <description>指定 MapReduce 的 shuffle 服务实现类</description>
  </property>

  <!-- 历史服务器配置 -->
  <property>
    <name>yarn.log.server.url</name>
    <value>http://your_historyserver_address:port</value>
    <description>设定历史服务器的URL地址</description>
  </property>

  <!-- 其他配置项 -->
</configuration>

  • capacity-scheduler.xml: This is the resource scheduler configuration file for YARN. It defines resource queues and resource quotas for managing and allocating cluster resources.
<configuration>
  <property>
    <name>yarn.scheduler.capacity.root.<queue_name>.capacity</name>
    <value>50</value>
    <description>定义队列的初始资源比例</description>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.<queue_name>.maximum-capacity</name>
    <value>100</value>
    <description>定义队列的最大资源比例</description>
  </property>
  <!-- 其他队列和资源配额的配置项 -->
</configuration>

  • mapred-site.xml: This is the configuration file for MapReduce, which is also used by YARN. Some MapReduce related configuration items can be set in this file.
<configuration>
  <!-- JobHistoryServer 配置 -->
  <property>
    <name>mapreduce.jobhistory.address</name>
    <value>your_jobhistoryserver_address:port</value>
    <description>指定 JobHistoryServer 的连接地址</description>
  </property>
  <property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>your_jobhistoryserver_webapp_address:port</value>
    <description>指定 JobHistoryServer 的 Web 应用连接地址</description>
  </property>

  <!-- MapReduce Task 配置 -->
  <property>
    <name>mapreduce.task.io.sort.mb</name>
    <value>512</value>
    <description>定义 MapReduce 任务的中间结果排序缓冲区大小(单位:MB)</description>
  </property>
  <property>
    <name>mapreduce.map.memory.mb</name>
    <value>2048</value>
    <description>定义 MapReduce map任务的最大内存限制(单位:MB)</description>
  </property>
  <property>
    <name>mapreduce.reduce.memory.mb</name>
    <value>4096</value>
    <description>定义 MapReduce reduce任务的最大内存限制(单位:MB)</description>
  </property>
  
  <!-- 其他配置项 -->
</configuration>

Please note that in the example <queue_name> and <path/to> both need to be replaced according to your actual needs.

2. YARN ResourceManager configuration

YARN’s ResourceManager (RM) is responsible for resource management and job scheduling of the entire cluster. You need to configure the yarn-site.xml file accordingly to enable and configure ResourceManager. The following are some important ResourceManager configuration items:

  • yarn.resourcemanager.hostname: Set the host name of the node where ResourceManager is located.
  • yarn.resourcemanager.address: Set the connection address of ResourceManager.
  • yarn.resourcemanager.scheduler.address: Set the scheduler connection address of ResourceManager.
  • yarn.resourcemanager.webapp.address: Set the Web application connection address of ResourceManager.

Make sure these items in the configuration file are set correctly to suit your cluster environment.

3. YARN NodeManager configuration

YARN’s NodeManager (NM) is responsible for resource management and task execution on each node. You need to configure the yarn-site.xml file accordingly to enable and configure NodeManager. The following are some important NodeManager configuration items:

  • yarn.nodemanager.local-dirs: Specify a local directory to store NodeManager's running data and temporary files.
  • yarn.nodemanager.log-dirs: Specify the storage directory for log files.
  • yarn.nodemanager.aux-services: Specify the name of the additional service, such as MapReduce shuffle service.
  • yarn.nodemanager.aux-services.mapreduce.shuffle.class: Specify the shuffle service implementation class of MapReduce.

Make sure these configuration items are set correctly based on your cluster environment.

4. YARN history server configuration

YARN’s history server (YARN History Server) is used to view job running logs and statistical information. You need to configure the yarn-site.xml file accordingly to enable and configure the history server. Here are some important historical server configuration items:

  • yarn.log.server.url: Set the URL address of the history server.

Set the correct history server URL based on your needs.

5. Other configuration items

In addition to the configuration files mentioned above, there are some other YARN configuration items that may need to be adjusted according to your needs. These configuration items include:

  • Queue configuration: If you need to set different queues and resource quotas for different types of jobs, you can usecapacity-scheduler.xmlconfiguration files to define queues and quotas.
  • Security authentication configuration: If your cluster requires security authentication, you need to configure related security authentication options, such as Kerberos and SSL.

Make sure to refer to Hadoop's official documentation and your cluster environment requirements when configuring these configuration items.

6. Start YARN

After completing the above configuration, you can start YARN and verify whether the configuration takes effect. Start the ResourceManager process and the NodeManager process and check the log files to see if there are any errors or exceptions.

You can also verify the normal operation of YARN by accessing the ResourceManager's web interface (usually http://your_resourcemanager_address:port).

Conclusion

By following the above steps to configure and start YARN, you can successfully enable and configure the YARN cluster resource manager and efficiently manage and allocate resources in the cluster.


Guess you like

Origin blog.csdn.net/m0_53157282/article/details/133323969