Hadoop series (two) - Cluster Explorer YARN

A, hadoop yarn Introduction

YARN the Apache (Yet Another Resource Negotiator) is a cluster resource management system hadoop 2.0 introduced. Users can be deployed on a variety of service framework YARN, unified management and allocation of resources by the YARN.

Two, YARN architecture

1. ResourceManager

ResourceManagerUsually in the form of a background process running later on a separate machine, it is the entire main cluster resource coordinator and administrator. ResourceManagerResponsible for allocating resources to all applications submitted by users, it is based on the information the application priority queue capacity, ACLs, location and other data, make decisions, and then to share, to develop distribution strategies secure, multi-tenant mode, scheduling cluster resources.

2. NodeManager

NodeManagerYARN is specific for each node in the cluster managers. Responsible for managing node within the life cycle of all containers, resource monitoring and tracking node health. details as follows:

  • The startup ResourceManagerand periodically send a heartbeat message register, waiting for ResourceManagerinstructions;
  • Maintenance Containerlife cycle, monitor Containerresource usage;
  • Dependency management tasks related to running, according to the ApplicationMasterneeds, starting Containerbefore the needs of the program and its dependencies to the local copy.

3. ApplicationMaster

When a user submits an application, YARN launches a lightweight process ApplicationMaster. ApplicationMasterResponsible for coordinating the ResourceManagerresources, and through NodeManagerthe use of the monitoring container resources, it is also responsible for monitoring and fault-tolerant tasks. details as follows:

  • Depending on the operating status of the application to determine the dynamic computing resource requirements;
  • The ResourceManageruse of resources application resources, monitoring the application;
  • Tracking task status and progress, progress information and application usage reporting resources;
  • Responsible for fault-tolerant task.

4. Contain

ContainerYARN is a resource abstraction of that encapsulates the multi-dimensional resource on a node, such as memory, CPU, disk, network and so on. When the AM application resources to RM, RM resources for the return of AM is Containerrepresented. YARN each assigned a task Container, the task can only use the Containerresources described. ApplicationMasterYou can Containerrun any type of tasks within. For example, MapReduce ApplicationMastera request to initiate a container map or reduce tasks, Giraph ApplicationMasterrequests Giraph a container to run the task.

Three, YARN works Brief

  1. Client Submit jobs to the YARN;

  2. Resource ManagerA selection Node Manager, a start Containerand running Application Masterexample;

  3. Application MasterAccording to the actual need to Resource Managerrequest additional Containerresources (if the job is small, the application manager will choose to run in its own task JVM);

  4. Application MasterBy getting to the Containercomputing resources distributed.

Four, YARN works in detail

1. Job Submission

client calls job.waitForCompletion method, submit MapReduce jobs (step 1) to the entire cluster. The new job ID (application ID) assigned by the resource manager (step 2). verification client job output job, calculating split input, resources of the job (including Jar package, profile, split information) to the HDFS copy (step 3). Finally, to submit job (step 4) by calling submitApplication Explorer ().

2. Job initialization

当资源管理器收到 submitApplciation() 的请求时, 就将该请求发给调度器 (scheduler), 调度器分配 container, 然后资源管理器在该 container 内启动应用管理器进程, 由节点管理器监控 (第 5 步)。

MapReduce 作业的应用管理器是一个主类为 MRAppMaster 的 Java 应用,其通过创造一些 bookkeeping 对象来监控作业的进度, 得到任务的进度和完成报告 (第 6 步)。然后其通过分布式文件系统得到由客户端计算好的输入 split(第 7 步),然后为每个输入 split 创建一个 map 任务, 根据 mapreduce.job.reduces 创建 reduce 任务对象。

3. 任务分配

如果作业很小, 应用管理器会选择在其自己的 JVM 中运行任务。

如果不是小作业, 那么应用管理器向资源管理器请求 container 来运行所有的 map 和 reduce 任务 (第 8 步)。这些请求是通过心跳来传输的, 包括每个 map 任务的数据位置,比如存放输入 split 的主机名和机架 (rack),调度器利用这些信息来调度任务,尽量将任务分配给存储数据的节点, 或者分配给和存放输入 split 的节点相同机架的节点。

4. 任务运行

当一个任务由资源管理器的调度器分配给一个 container 后,应用管理器通过联系节点管理器来启动 container(第 9 步)。任务由一个主类为 YarnChild 的 Java 应用执行, 在运行任务之前首先本地化任务需要的资源,比如作业配置,JAR 文件, 以及分布式缓存的所有文件 (第 10 步。 最后, 运行 map 或 reduce 任务 (第 11 步)。

YarnChild 运行在一个专用的 JVM 中, 但是 YARN 不支持 JVM 重用。

5. 进度和状态更新

YARN 中的任务将其进度和状态 (包括 counter) 返回给应用管理器, 客户端每秒 (通 mapreduce.client.progressmonitor.pollinterval 设置) 向应用管理器请求进度更新, 展示给用户。

6. 作业完成

除了向应用管理器请求作业进度外, 客户端每 5 分钟都会通过调用 waitForCompletion() 来检查作业是否完成,时间间隔可以通过 mapreduce.client.completion.pollinterval 来设置。作业完成之后, 应用管理器和 container 会清理工作状态, OutputCommiter 的作业清理方法也会被调用。作业的信息会被作业历史服务器存储以备之后用户核查。

五、提交作业到YARN上运行

这里以提交 Hadoop Examples 中计算 Pi 的 MApReduce 程序为例,相关 Jar 包在 Hadoop 安装目录的 share/hadoop/mapreduce 目录下:

# 提交格式: hadoop jar jar包路径 主类名称 主类参数
# hadoop jar hadoop-mapreduce-examples-2.6.0-cdh5.15.2.jar pi 3 3

参考资料

  1. 初步掌握 Yarn 的架构及原理

  2. Apache Hadoop 2.9.2 > Apache Hadoop YARN

更多大数据系列文章可以参见 GitHub 开源项目大数据入门指南

Guess you like

Origin www.cnblogs.com/heibaiying/p/11306842.html