YARN framework

YARN framework Description:

YARN (Yet Another Resource Negotiator, another resource coordinator) is a new Hadoop Explorer, it is a universal resource management system that provides a unified application for the upper resource management and scheduling, its introduction in the cluster utilization of resources, unified management and data sharing and so brought great benefits.

YARN concept:

YARN The basic idea is to JobTracker (Job Tracker) The two main functions (resource management and job scheduling / monitoring) separation, the main method is to create a global ResourceManager (RM) and several for ApplicationMaster (AM) applications . Application herein refers to conventional MapReduce jobs or job the DAG (directed acyclic graph).

YARN nature of hierarchy is ResourceManager. This entity controls the whole cluster and manage the distribution of applications to computing resources base. ResourceManager part of the individual resources (compute, memory , bandwidth, etc.) to elaborate the basis NodeManager (YARN each node agent). ResourceManager also allocate resources and ApplicationMaster together, start and monitor their application basis with NodeManager. In this context, ApplicationMaster bear some of the previous role of TaskTracker, ResourceManager took JobTracker role.

ApplicationMaster manage each instance of the application running in the YARN. ApplicationMaster resources from ResourceManager responsible for coordinating and monitoring the container by NodeManager implementation and use of resources ( the CPU , memory and other resource allocation). Please note, while it's more traditional resources (CPU core, memory), but the future will bring a new resource types (such as graphics processing units or special purpose processing device) based on the task at hand. From the point of view YARN, ApplicationMaster user code, so there is a potential security problem. YARN assume that there ApplicationMaster errors or even malicious, so they will be treated as a non-privileged code.

NodeManager manage each node in a cluster YARN. NodeManager provided for each node in the cluster service, from supervision and management of the life of a vessel to track and monitor resource node health. MRv1 by performing slot management Map and Reduce tasks, while management NodeManager abstract containers, which represent a resource for specific application use for each node. YARN continue to use HDFS layer. Its main NameNode metadata services, DataNode used to disperse copy storage services in a cluster.

To use a YARN cluster, we first need to include a request from the client application. ResourceManager negotiate a container of resources necessary to start a ApplicationMaster to represent applications that have been submitted. By using a resource request protocol, ApplicationMaster container each node negotiate resources used by the application for reference. When the execution of the application, ApplicationMaster monitoring vessel until completion. When the application is complete, ApplicationMaster cancellation of its container from ResourceManager, execution cycle is complete.

YARN task scheduling scheme:

Execution job.waitforcompletion () method, the built-in process:

Apply for a job to ResourceManager (RM) initiates a request
RM return to job-related resources and job submission path of the jar (staging-dir), and jobID
Submit jar to staging-dir (HDFS path)
RM submit the results reported to the
RM task will be added to the job queue
NodeManager (NM) node to receive the task
NodeManager container resource allocation run
RM call a node running mrappMaster (ApplicationMaster) NodeManager in
NodeManager run mrappMaster node registered with the RM
mrappMaster allocate some NodeManager node running the Map task task
NodeManager distribution node running Reducer task (task)
When the task is completed Reducer task to deregister its RM

YARN several operating modes:

MR program submitted several operating modes

Local model runs

1 / windows runs directly inside the main method of eclipse, will be submitted to the local job execution actuator localjobrunner
---- input and output data may be placed in the local path (C: / WC / srcdata /)
---- Input the output data can also be placed in hdfs (hdfs: // weekend110: 9000 / WC / srcdata)

2 / linux is running the main method directly in the eclipse inside, but do not add yarn-related configuration, will be submitted to the Executive localjobrunner
--- - input and output data can be placed under the local path (/ Home / Hadoop / WC / srcdata /)
---- input and output data can also be placed in hdfs (hdfs: // weekend110: 9000 / wc / srcdata)

Cluster mode

1 / The project labeled jar package, uploaded to the server, and then submit hadoop jar wc.jar cn.itcast.hadoop.mr.wordcount.WCRunner with hadoop command
2 / run directly in the eclipse linux's main method can also be submitted to the cluster to run, however, must take the following measures:
---- joined in the project src directory mapred-site.xml and yarn-site.xml (by default when an object is instantiated job configuration items will be added)
--- - the project labeled jar package (wc.jar), while adding a configuration parameter conf.set when instantiating the object job ( "mapreduce.job.jar", "wc.jar" (jar package directory));

Configuration conf = new Configuration();
conf.set("mapreduce.job.jar","wc.jar");
Job job = Job.getInstance(conf);

3 / Run eclipse directly in the windows of the main method can also be submitted to the cluster running, but because it is not compatible with the platform, you need to do a lot of settings to modify
       ---- want to store a copy of the installation package hadoop in windows (decompression good a)
       ---- to which the bin directory lib and replace according to your windows version to re-compile the file
       ---- and then to configure the system PATH environment variable HADOOP_HOME and
       source ---- modify YarnRunner this class

Job description and submit written standard class (cluster mode second way)

package com.sy.hadoop.mapreduce;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

//这是job描述和提交类的规范写法
public class FlowSumRunner  extends Configured implements Tool {

    @Override
    public int run(String[] args) throws Exception {

        Configuration conf = new Configuration();
        conf.set("mapreduce.job.jar","wc.jar");
        Job job = Job.getInstance(conf);
        
        //配置jar运行类
        job.setJarByClass(FlowSumRunner.class);
        
        //配置map类
        job.setMapperClass(FlowSumMapper.class);
        
        //配置reduce类
        job.setReducerClass(FlowSumReducer.class);

        //配置map输出key类型
        job.setMapOutputKeyClass(Text.class);
        
        //配置map输出value类型
        job.setMapOutputValueClass(FlowBean.class);

        //通用配置如果map输出key和输出value和reduce一样那么就不用配置上面的,如果不一样就需要使用它单独配置reduce配置
        //配置reduce输出key类型
        job.setOutputKeyClass(Text.class);
        
        //配置reduce输出value类型
        job.setOutputValueClass(FlowBean.class);

        //配置log输入文件目录(map读取文件)
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        //输出文件目录
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        //启动job
        return job.waitForCompletion(true)?0:1;
    }


    public static void main(String[] args) throws Exception {
        int res = ToolRunner.run(new Configuration(), new FlowSumRunner(), args);
        System.exit(res);
    }

}

Silk Night _

Published 70 original articles · won praise 18 · views 50000 +

Private letter concerns