hadoop study notes (IX): mapReduce1.x

A data, MapReduce1.0 divided into the data calculated

 

MapReduce is further computational model we often use off-line data processing, when a large, MapReduce encapsulated calculation process well, we only use the Map and Reduce function

 

 

 

nput

Input but enter the storage location of the file,

But be careful here are some of the blog and be sure to say, of course, like HDFS distributed file system location, the default is the HDFS file system, of course, you can also be modified.

It may also be a file location on the machine.
We have to carefully analyze the next input

 

 

 

 

First of all we know and JobTracker deal is inseparable JobClient this interface, just as shown above,

Then the Run method will JobClient JobClient Hadoop Job all the information, such as the mapper reducer jar path, mapper / reducer class name, the input file path or the like, to tell the JobTracker, as shown in the following code

public int run(String[] args) throws Exception {
        
        //create job
        Job job = Job.getInstance(getConf(), this.getClass().getSimpleName());
        
        // set run jar class
        job.setJarByClass(this.getClass());
        
        // set input . output
        FileInputFormat.addInputPath(job, new Path(PropReader.Reader("arg1")));
        FileOutputFormat.setOutputPath(job, new Path(PropReader.Reader("arg2")));
        
        // set map
        job.setMapperClass(HFile2TabMapper.class);
        job.setMapOutputKeyClass(ImmutableBytesWritable.class);
        job.setMapOutputValueClass(Put.class);
        
        // set reduce
        job.setReducerClass(PutSortReducer.class);
        return 0;
    }

 

In addition, JobClient.runJob () will do one thing, using InputFormat class to calculate how to divide the input file into a copy, and then to process mapper. inputformat.getSplit () function returns a InputSplit's List, each InputSplit mapper is a need to deal with the split data.

 

A Hadoop Job's input can be either a big file, there can be multiple file; no matter what, getSplit () will calculate how to split the input.

If HDFS file system , we all know that it can be split files stored in the form of a block on many computers, so that it can store very large files. So Mapper is how to determine a HDFS file block which several computers stored, what data?

 

 

 

It is actually a inputFormat interface, classes need to inherit, provide input of logical partitions.

Jobclient there is a method called setInputFormat (), through which we can tell what you want JobTracker InputFormat class using. If we do not set, Hadoop default is TextInputFormat, it defaults to generate a corresponding InputSplit file for each Block on HDFS's. So we use Hadoop, can also write your own input format, so you can freely choose split input of algorithm, even processed data stored outside of HDFS.

 

JobTracker mapper try to arrange it on the data to be processed from the more recent machines, in order to read data from the unit mapper, saving network transmission time. How realization is achieved?

For each map task, we know the host location data contained in its split where we put the mapper arranged in the corresponding host Shang Hao, at least relatively close to host you may ask: Host split in storage location is the data stored in HDFS host, and the host MapReduce what it relevant? In order to achieve data locality, in fact, usually MapReduce and HDFS deployed on the same set of hosts.

 

Since a InputSplit corresponds to a map task, task map when it received information processing location data, it can read data from HDFS.

 

 

Next we see from the map function Input

map function takes is a key value pair.

In fact, each of the input data will mapper Hadoop divided again, into one key-value pairs, and for each key-value pairs, a Map function call for this segmentation step, using Hadoop to another class: RecordReader. its main method next (), an action is read out from the key-value pairs InputSplit.

RecordReader InputFormat may be defined for each class. When we () tells Hadoop inputFormat class name by JobClient.setInputFormat time, RecordReader also a definition and passed over.

 

So the whole Input,

1.JobClient enter the storage location of the input file

2.JobClient interface provided by InputFormat logical partitions, the default file is split by HDFS.

3.Hadoop split the file again as key-value pairs.

4.JobTracker responsible assigned respective divided blocks by a corresponding maper processing while RecordReader responsible for reading the value of key-value pairs.

 

Mapper

After JobClient division information input operation to obtain the client and the calculated desired profile. And this information is stored in the file JobTracker specially created for the job folder. Folder named Job ID of the job. JAR files by default will be 10 copies (mapred.submit.replication attribute control);

Then enter the partition information to tell how many map tasks JobTracker should start and other information for this job.

By reporting to JobTracker TaskTracker heartbeats and the slot (case), each slot can accept a map task, so equal distribution to each machine map task, where each slot JobTracker accept a TaskTracker monitored.

JobTracker After receiving the job, which was placed in a job queue, waiting for their job scheduler to schedule when the job to the job scheduler schedules according to their scheduling algorithm creates one for each division split information according to the input map task, and the task is assigned to map TaskTracker performed, the allocation of slot according to the situation as a standard.

TaskTracker JobTracker from time to time will send a heartbeat, tell JobTracker it is still running, while the heart is also carrying a lot of information, such as the current map task completion schedule and other information. When the last task JobTracker receive job completion information, it gave the job is set to "success." When JobClient query state, it has been learned that the task is completed, a message is displayed to the user.

Map reading Input by RecordReader key / value pair of, map according to the user-defined tasks, it has finished running, to produce another set of key / value, and writes it to the buffer memory Hadoop taken, the memory buffer the key / value pairs sorted by key, this time will follow reduce partition, into different partition, once the memory is full, the file will be written to the local disk where the file is called spill file.

shuffle

Shuffle is a module we do not need to write, but it is critical module.

 
 

In the map, each map function outputs a set of key / value pairs, Shuffle phase map from all hosts requires the same key to the key value together, (i.e. is omitted here Combiner stage) after the combination of mass to reduce the host, as an input into the reduce function inside.

Partitioner component which is responsible for calculating key should be placed in the same reduce

HashPartitioner class, it will key into a hash function, and then the results obtained. If the hash values ​​of the two key, like their key / value pairs will be placed in a single reduce function. We also allocated to the same reduce function in the key / value partition to be called a reduce.

We see how many different hash function eventually produce results, this will be the number of Hadoop job reduce partition / reduce function, which ultimately reduce function is assigned to the host is responsible for JobTracker reduce and processing.

We know that the map phase could produce multiple spill file When the Map, the spill file will be merge together, not merge into one file, but will also reduce partition into multiple press.

When the Map tasks successfully completed, they will notify tasktracker responsible, then the message to jobtracker by heartbeat jobtracker Thus, the association for each job, jobtracker know the map output and map tasks. Reducer has an internal thread is responsible for regularly ask the location map output to jobtracker, get all the map output reducer until it needs to handle the position.

Reducer another thread will copy over the map output file merge into a larger file. If the map task is to configure the need to map output compression, that reduce but also to map the results to decompress. When a reduce task all the map output is copied onto one of its host when, reduce them to begin the sort.

Sort all the file are not a sort, but rather a series of rounds. After each round to produce a result, and then sort the results. Sorting the final round would not produce the results, but to provide direct input to reduce. At this time, reduce user-supplied function can be called. Enter key map mission is to produce the value pairs.

At the same time reduce task is not complete until after the end of the beginning of the mission map, Map tasks may end at different times, so no need to reduce the task was over and all map tasks began. In fact, each reduce task There are some threads dedicated to copy map outputs from the map the host (the default is 5).

Reduce

 
 

reduce() 函数以 key 及对应的 value 列表作为输入,按照用户自己的程序逻辑,经合并 key 相同的 value 值后,产 生另外一系列 key/value 对作为最终输出写入 HDFS。

一定要注意以上为MapReduce1.0的过程,而且现在MapReduce已经升级到了2.0版本

 

二、map reduce1.x中角色分工以及容易遇见的问题

  

        1.JobTracker

    主节点,单点,负责调度所有的作用和监控整个集群的资源负载。  -------------------》因为管理的任务和进程太多容易造成单点故障和单点瓶颈

  2.TaskTracker

    从节点,自身节点资源管理和JobTracker进行心跳联系,汇报资源和获取task。

  3.Client

    程序员根据需求编写代码java之后封装成jar上传到HDFS系统,以作业为单位,规划作业计算分布,提交作业资源到HDFS,最终提交作业到JobTracker。

 

问题:

       1.JobTracker负载过重,存在单点故障。

  2.资源管理和计算调度强耦合,其它计算框架难以复用其资源管理。

  3.不同框架对资源不能全局管理。

 

 

 

三、注意点:

一、job Tracker、Task Tracker等组件的工作流程示例

1、他还会负责的工作是:将客户端编写的设置
split个数,多少个reduce多少个reduce,从哪里去读这些数据呀。计算完毕之后写到哪里去呀等参数之后,之后会将java代码打成jar包提交到
HDFS系统的放到NameNode上。而不是Job Tracker上,因为这个放到这个上面会出现单点瓶颈或者单点故障的问题
所以会将相应的设置jar文件放到NameNode上面,当提交到Namenode之后Job Tracker从中获取到具体的任务配置之后,

2、一个Task Tracker是负责所在的结点上资源进程的占用情况

3、一个是client客户端写的用于设置split、reduce的个数的java接口代码,打成的jar包,上传到NameNode上之后,由每一个已经有Task Tracker进程的结点上下载下来相应的jar文件,按照jar文件的设置,对去开启相应的Map Task 或者Reduce Task进程在相应的结点上

 

参考:

  https://www.jianshu.com/p/461f86936972

Guess you like

Origin www.cnblogs.com/isme-zjh/p/11618880.html