Hadoop MapReduce framework of a new generation of faster, stronger

For the industry's big data storage and distributed processing systems, is an excellent open-source Hadoop distributed file storage and processing of familiar framework, Hadoop framework for the introduction of this no longer tired, with the development of demand, Yarn framework emerges water, @ still glorious revival of blog gave us a very detailed description, the reader by comparing the old and new in this article Hadoop MapReduce framework, more profound understanding of the technical principles and design of new yarn framework.

Background

Yarn is a distributed resource management system to improve resource utilization in distributed cluster environment, these resources include memory, IO, network, disk and so on. The causes are inadequate to solve the original MapReduce framework. The initial MapReduce committer who also periodically make changes in the existing code, but with the increase of the code and the lack of original MapReduce framework design modification is becoming increasingly difficult in the original MapReduce framework, the MapReduce committer they decided to redesign the architecture of MapReduce, the next-generation MapReduce (MRv2 / Yarn) framework has better scalability, availability, reliability, backward compatibility and higher resource utilization and support in addition to MapReduce computational framework more calculated outside the framework.

Less than the original MapReduce framework

recommended a large group learning data 606 859 705 20:10 pm every day, a [free] big data live courses, focused analysis of large data, large data programming, large data warehousing, big data cases, artificial intelligence data mining is pure dry goods share

 



JobTracker point is to focus on a cluster of affairs, there is a single point of failure

JobTracker too many tasks to be done, not only to maintain but also to maintain the state of the job status of the job of the task, resulting in excessive resource consumption

In taskTracker end, with a map / reduce task represented as resource too simple, it does not take into account the CPU, memory and other resources, together, is prone to two when OOM consumes large memory task schedule to

the forced resource map is divided into / reduce slot, when only the map task, reduce slot can not be used; when only reduce task, map slot can not be used, easily lead to lack of resources.
Yarn architecture

Yarn / MRv2 basic idea is original JobTracker major resource management and job scheduling / monitoring functions separately as two separate daemon. There is a global ResourceManager (RM) and has a per Application ApplicationMaster (AM), Application corresponds map-reduce job or DAG jobs. ResourceManager and NodeManager (NM) calculated data form the basic framework. ResourceManager coordination cluster resource utilization, or any client running of applicatitonMaster want to run the job or task RM had to apply for certain resources. ApplicatonMaster is a special framework library, for the MapReduce framework to achieve its own AM, users can realize their AM, at run time, AM will start together with NM and monitoring tasks.

The ResourceManager

the ResourceManager as coordinator of resources has two main components: Scheduler and ApplicationsManager (AsM).

Scheduler responsible for a minimum allocation but the amount of resources needed to meet the application to run Application. Just Scheduler scheduling based on the use of resources is not responsible for monitoring / tracking the state of application, of course, will not deal with failures task. RM using the resource container concept for managing cluster resources, resource container resource is abstract, each container including a certain memory, IO, network and other resources, but the current implementation includes only memory as a resource.

ApplicationsManager responsible for handling client submissions job and negotiate a container for applicationMaster first run and will restart when applicationMaster failed applicationMaster. The following describes some of the features specific RM done.

1. Resource Scheduling: Scheduler to build a global plan of resource allocation after receiving resource requests from all the running of the application, and then allocate resources according to application specific restrictions as well as some global constraints.

2. Resource Monitoring: Scheduler periodically receive monitoring information from the resource usage of NM, state information can be obtained additionally applicationMaster belonging to it has been completed container from the Scheduler.

3.Application submit:

Client will get a applicationIDclient definition and application jar package needs to asm

Client application definitions, and the jar package files need to be uploaded to the specified directory hdfs, the yarn-site.xml of yarn.app.mapreduce. am.staging-dir specified

target resource request and the client configuration AsM submitted to a context of the application

ASM received application context submitted

ASM according to the negotiation information Scheduler application for a Container applicationMaster operation, then start applicationMaster

the NM transmission belongs to the container the container launchContainer start information, i.e. start applicationMaster, AsM provided to the client running in the AM status information.

4.AM life cycle: AsM responsible for managing the life cycle of all systems in the AM. AsM responsible for starting the AM, when the AM start, AM will periodically send AsM heartbeat, the default is 1s, AsM accordingly understand survival AM, and AM in the AM is responsible for restarting fails, if after a certain period of time (default 10 min) did not receive a heartbeat AM, asM considers the AM failed.

Availability on ResourceManager has not yet well implemented, but Cloudera's future CDH4.4 version implements a simple high availability, use code Hadoop-common part of the project, HA, adopted a similar hdfs namenode high availability design, to the RM introduced the active and standby status, but not with journalnode corresponding to the role, but the zookeeper in charge of maintaining the status RM, this design is just one of the most simple solution to avoid manually restart the RM, from the real production available still some distance.

NodeManager

NM primarily responsible for initiating RM assigned to the container and container AM for AM, and will monitor the operation of the container. When starting container, NM will set the necessary environment variables and run the required container jar package, files downloaded from hdfs to local, so-called resource localization; when all the preparation work, will the container represents the start of the script program starts up. After starting up, NM will periodically monitor the resources occupied by the container operation, if the amount exceeds the resources of the container stated, the process will kill off the container represents.

In addition, NM also provides a simple service to manage its local directory on the machine. Applications can continue to access the local directory on that machine even if the container has not belong to it running. For example, Map-Reduce applications using this service and shuffle memory map output thereof to a corresponding reduce task.

On the NM can also extend their services, yarn provides a configuration item yarn.nodemanager.aux-services through this configuration, the user can customize some services, such as Map-Reduce the shuffle function is implemented in this way .

NM generate the following directory structure for each application running in the local:

 


the directory structure in the directory Container as follows:

 


When starting a container, NM on the implementation of default_container___executor.sh the container, the interior of the script will be executed launch_container.sh. launch_container.sh will first set some environment variables, and finally start the command execution of the program. For MapReduce, AM start on the implementation of org.apache.hadoop.mapreduce.v2.app.MRAppMaster; start map / reduce task on the implementation of org.apache.hadoop.mapred.YarnChild.
ApplicationMaster

ApplicationMaster is a special framework library for Map-Reduce computing model has its own ApplicationMaster achieved, for other want to run in terms of the calculation model on the yarn, have to realize for the calculation model for ApplicationMaster RM task to run the application resources, such as running a spark in the framework of the yarn also has a corresponding ApplicationMaster achieve the final analysis, yarn is a resource management framework, not a computational framework, in order to run the application on the yarn, but also a specific calculating frame achieved. As the yarn is accompanied MRv2 appear together, so the following is a brief overview of MRv2 running processes on the yarn.

MRv2 running processes:

MR JobClient submit a job to the resourceManager (AsM)

AsM to Scheduler request a container MR AM running for, and then start it

after MR AM start up to AsM registered

acquire MR JobClient to AsM to the relevant information MR AM, then communicating directly with MR AM

MR AM computing splits and requests for all map configuration resources

MR AM to do some preparatory work necessary MR OutputCommitter of

MR AM initiate a resource request to RM (Scheduler), get a group of container for map / reduce task to run, and then with NM for each container to perform necessary tasks, including resource localization

MR AM monitor running until the completion of the task, when the task fails, apply for a new container operation failed task

when each map / reduce task completion, MR AM MR OutputCommitter run the cleanup code that is some finishing work

when all the map / reduce completion of the necessary job commit MR AM run OutputCommitter APIs or ABORT

the MR AM exit.

Write applications on Yarn

Written application on the yarn and different from MapReduce applications we know, yarn must keep in mind just a resource management framework, not a computational framework, computing framework can run on yarn. All we can do is to RM application container, then start together with NM container. Like MRv2, jobclient request for MR AM operation of container, set the environment variable and start command, and then handed over to start the MR AM NM, then map / reduce task on the sole responsibility of the MR AM, of course, the task is started by the MR AM RM to apply for container, and then together with NM started. So, we need to run the program in a non-specific computational framework of the yarn, we have to realize their client and applicationMaster. In addition, we custom classpath AM needs to be placed under each NM, because the AM may be run on any machine NM is located.
Recommend learning a large data base 606 859 705 20:10 pm every day, a [free] big data live courses, focused analysis of large data, large data programming, large data warehousing, big data cases, artificial intelligence, data mining are pure dry Share

Guess you like

Origin blog.csdn.net/qq_41753040/article/details/90633737