Big Data learning development technology: MapReduce operation principle

MapReduce is a programming model for large data sets (greater than 1TB) parallel computing. MapReduce idea of ​​using "divide and conquer", the large-scale

Operation data set distributed to a respective sub-node under the management of common master node is completed, and then by integrating results of the respective intermediate nodes to obtain a final result. Simply

Say, MapReduce is the "summary of decomposition and the results of the task."

TaskTracker above figure correspond to the HDFS DataNode,

In MapReduce1.x, perform MapReduce job machine for two roles: one is JobTracker; the other is TaskTracker, JobTracker for scheduling work, TaskTracker for implementation. A Hadoop cluster only one JobTracker.


Process Analysis

  1. Start a task on the client, the client requests to JobTracker a Job ID.

  2. Copy programs required to run the task files to the HDFS, packaged MapReduce comprises JAR files, configuration files, and enter a client computing resultant split information. These files are stored in the file JobTracker created specifically for the task folder. Folder name Job ID.

  3. After receiving the task JobTracker, put it in a queue, the scheduler waits for its scheduling, when the job scheduler based on their scheduling algorithm to schedule the task, the task will create a map according to the input of the N division information, and task assigned to map the N TaskTracker (DataNode) performed.

  4. map task is not casually assigned to a TaskTracker, where there is a concept called: Localized Data (Data-Local). Means: assigning tasks to the map TaskTracker block containing the map data processing, while the program package to the JAR TaskTracker copy run up, called "move operation, data is not moved." The data does not consider the allocation reduce localization task.

  5. TaskTracker JobTracker from time to time will send a Heartbeat (heartbeat), told JobTracker it is still running, while the heart is also carrying a lot of information, such as the current progress of the task is completed map information. When the last task JobTracker receive job completion information, it gave the job is set to "success." When JobClient query state, it has been learned that the task is completed, a message is displayed to the user.

These are the client, JobTracker, TaskTracker to analyze the level of MapReduce works, let's again more carefully, hierarchical map task and reduce task analysis to analyze it.


MapReduce running processes

Wordcount In an example, a detailed flowchart of operation is as follows

1.split stage

First mapreduce will be split based on large files to be run, each input slice (input split) for a map task, slicing of the input (input split) is not stored in the data itself, but the length of a piece of data points and a record an array of locations. Slicing of the input (input split) and often the HDFS block (block) a very close relationship, if we set the HDFS block size is 64MB, we are running a large file 64x10M, mapreduce map will be divided into 10 tasks, each map tasks are present in the block which it is to be calculated (block) DataNode.

2.map stage

phase map is prepared by the programmer map function, so the function map efficiency is relatively well controlled, and localized operations are generally map operation is performed on the data storage node. Map function of this embodiment is as follows:

1.  publicclassWCMapperextendsMapperLongWritable,Text,Text,IntWritable{@Override

2.  protectedvoidmap(LongWritablekey,Textvalue,Contextcontext)throwsIOException,InterruptedException{

3.  Stringstr=value.toString();

4.  String[]strs=StringUtils.split(str,''); for (Strings:strs){

5.  context.write(newText(s),newIntWritable(1));

6.  }

7.  }

8.  }

The sub-word space Geqie counted as 1, a word is generated key, value appears as a map for subsequent calculation times.

3.shuffle stage

shuffle stage is mainly responsible for transmitting the generated map data to reduce end terminal, so during the shuffle into map at the end of execution and reduce end.

Look at the map ends:

  1. First, the determination of which map data partition belong result data, wherein a partition corresponds to a reduce, reduce the number% is generally achieved by key.hash ().

  2. 把map数据写入到Memory Buffer(内存缓冲区),到达80%阀值,开启溢写进磁盘过程,同时进行key排序,如果有combiner步骤,则会对相同的key做归并处理,最终多个溢写文件合并为一个文件。

reduce端:

reduce节点从各个map节点拉取存在磁盘上的数据放到Memory Buffer(内存缓冲区),同理将各个map的数据进行合并并存到磁盘,最终磁盘的数据和缓冲区剩下的20%合并传给reduce阶段。

4.reduce阶段

reduce对shuffle阶段传来的数据进行最后的整理合并

1.  publicclassWCReducerextendsReducerText,IntWritable,Text,IntWritable{@Override

2.protectedvoidreduce(Textkey,IterableIntWritablevalues,Contextcontext)throwsIOException,InterruptedException{intsum=0; for (IntWritablei: values ){

3.  sum+=i.get();

4.  }

5.  context.write( key ,newIntWritable(sum));

6.  }

7.  }

MapReduce的优缺点

优点:

  1. 易于编程;

  2. 良好的扩展性;

  3. 高容错性;

4.适合PB级别以上的大数据的分布式离线批处理。

缺点:

  1. 难以实时计算(MapReduce处理的是存储在本地磁盘上的离线数据)

  2. 不能流式计算(MapReduce设计处理的数据源是静态的)

  3. 难以DAG计算MapReduce这些并行计算大都是基于非循环的数据流模型,也就是说,一次计算过程中,不同计算节点之间保持高度并行,这样的数据流模型使得那些需要反复使用一个特定数据集的迭代算法无法高效地运行。

查看更多文章:

没有基础想学大数据难吗?

大数据入门学习,你要掌握这些技能

大数据领域三个大的技术方向

自学大数据从哪入手

大数据专业未来就业前景如何?

教你大数据必修三大技能 ,快快记录下来

Guess you like

Origin blog.csdn.net/kangshifu66/article/details/93774534