Introduction to hadoop--a brief introduction to the collaboration process of hadoop 2.x Yarn components

system structure

hadoop2.x mainly includes three core parts:

(1) hdfs-distributed storage component

The basic component of hadoop for storing data. Distributed, HDFS clusters that interact across the network.

(2) yarn-resource management and task scheduling components

Hadoop is the basic component used for resource management and task scheduling. Yarn makes Hadoop a general platform for distributed data processing, and supports various computing frameworks such as MapReduce v2, Tez, and Hoya.

(3) processing framework-distributed computing framework

There are many computing frameworks for different computing models, such as MapReduce v2 for batch processing, Giraph for image processing, Storm for stream data processing, etc.

(4) API-application programming interface

A parallel computing programming interface for users to interact with hadoop.


Yarn Components

(1) Resource Manager

Resource Manager is the core component of yarn, it manages all data processing resources of hadoop cluster. The task of the Resource Manager is to maintain a global view of the resources of the Hadoop cluster, process resource requests, schedule requests and allocate resources to requesting applications. The Resource Manager is essentially a dedicated scheduler that allocates resources to the requesting application, but it relies on a scheduling module that performs the actual scheduling logic.

Resource Manager is agnostic to the application and computing framework. Yarn's Resource Manager has no concept of map task or reduce task, does not track the progress of job work and task task, and does not handle faults. The only task of the Resource Manager is to schedule workloads. The high degree of separation of duties makes yarn easier to expand, can provide a more general hadoop platform for applications, and also enables yarn to support multi-tenant hadoop clusters.

(2) Node Manager

Each slave node has a Node Manager daemon, which acts as the slave of the Resource Manager. Each slave node has a service that is associated with the processing and storage services that make hadoop a distributed system. Each Node Manager tracks the data processing resources available to the node and sends reports to the Resource Manager on a regular basis.

Processing resources in a Hadoop cluster are consumed in the form of containers. A container is a collection of resources necessary to run an application, including CPU cores, memory, network bandwidth, and disk space. A deployed container runs as an independent process on a node in the hadoop cluster. All container processes running on a slave node are initially configured, monitored and tracked by the slave node's Node Manager daemon.

Tip: The concept of container in hadoop2 is similar to the concept of slot in hadoop1, but there are many differences: 1) Slot is defined for running map or reduce tasks; while container is generic and can allow any application logic. 2) A container can request a user-defined number of resources, as long as the number of resources requested is within the range of resources contained in a container; however, when requesting a slot resource, a complete slot resource is allocated.

(3) Application Master

Each application running in a hadoop cluster has its own dedicated Application Master instance, which in fact runs in the node's container process. During the entire life cycle of the Application Master instance, it sends heartbeat information to the Resource Manager, informing the Resource Manager of the state of the Application Master instance and the application's need for resources. Based on the scheduling result of the Resource Manager resource manager, the resource manager will assign the container resource lease right to the Application Master instance of a specific slave node - the need for pre-booking resource containers.

The Application Master oversees the entire life cycle of the application, including requesting resource containers from the Resource Manager to submitting container resource lease requests to the Node Manager.

Tip: Each application computing framework must have its own Application Master implementation. For example, MapReduce has an Application Maser implementation dedicated to performing map and reduce tasks.


Publish a Yarn-based application and understand the process of Yarn component collaboration

  • 1) The client application submits a request to the Resource Manager.
  • 2) The Resource Manager submits an Application Master Instance creation request to the Node Manager.
  • 3) Node Manager obtains the available container and starts the container process.
  • 4) The Application Master is initialized in the container process and registered with the Resource Manager.
  • 5) The Application Master submits a request to the NameNode to obtain the file name, location, and data blocks that the application needs to process, and calculates how many map and reduce tasks are required to process these data blocks.
  • 6) The Application Master sends heartbeat information (with a list of requested resources and state changes) to the Resource Manager to request the resources necessary for the application to run.
  • 7) The Resouce Manager accepts the resource request, and puts the request in the queue of the request to be scheduled, waiting to be scheduled. When the requested resource is available on the slave node, the Resource Manager grants the Application Master instance the container resource lease right.
  • 8) Application Master sends CLC (including everything needed for application tasks: environment variables, authorization tokens, local resources at runtime, command line information to start the actual process) to Node Manager, requesting the container assigned by Resource Manager. Then Node Manager creates the container process and starts the process.
  • 9) When the container process starts, the application starts executing. The Application Master oversees the progress of the application.
  • 10) When all the tasks of the application are completed, the Application Master sends the result set to the client, notifies the Resource Manager that the application is completed and revokes from the Resource Manager, and closes its own instance.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325867139&siteId=291194637