The MapReduce hadoop flow analysis data flow from the perspective of

From the perspective of the data flow process MapReduce

JobTracker: responsible for scheduling of tasks and monitor cluster resources

Responsible for the implementation and reporting heartbeat task: TaskTracker

1、input

MapReduce frame will first be calculated as a file on the HDFS slicing input forming the input slice (InputSplit), a subclass of class FileInputFormat InputFormat class each InputSplit Map as a task input, then key-value pair parsed into InputSplit . InputSplit the key input to the parsed as a function map. The default is the line number and the content key as the Bank for most input , the size may be larger than the size InputSplit a data block may be smaller than the size of the data block.

2, map, and outputs an intermediate result

Each task has a ring map memory buffer for storing the output of the map function. When the contents of the buffer threshold is reached, a background process will write the contents of the buffer overflow (spill) to disk.

(1), before writing the disk, the data buffer will partition (Partition) Reducer according to eventually be transmitted, in accordance with the default key. In each partition, according to a background thread to key in the sort (the first order of Quick Sort).

(2), before writing the disk, can be Combiner is, a combined effect of the same key value. The goal is to map the output of the intermediate results more compact, so that the disk and written to the local data transmitted to Reducer less efficiency, Combine for selecting the maximum value, minimum value, and summing operations.

(3), the second time to sort merge multiple files to write overflow do is merge sort, the final result of the merger is already partitioned and sorted output file.

3.shuffle also called data shuffler

(1), the operation is a pull copies (the Copy) intermediate node to map the result Reducer perform tasks taskTracker located. A plurality of partitions identical results would spill pulling copy process to copy a reducer on .

When the reducer is equal to the number of number of partition, each reducer result contains data only one partition.

When the number of the reducer is less than the number of partition, a partition data will appear simultaneously at the top a reducer and a reducer two data partitions above.

When the number of the reducer is more than the number of partition, the result is a plurality of task null.

Here Insert Picture Description

Here Insert Picture Description

(2), the third sort , the map of the plurality of intermediate merge results. Merge sort is used.

When all copies are completed map output, all data are finally merged into a sorted files, reduce tasks as input. After shuffle files are processed and ordered key area. But it is no sequence relationship (see the second ordering example) between each reducer is different reducer.

4, the same partition file called once reduce processing functions.

5. Example: word count statistics

(1), each input Map task as a InputSplit. InputSplit map the input function key on the parsed

(2), output of the map task is to first map will have multiple tasks, the output of each map is a task of partitioned and sorted files

(3), shuffle will automatically merge and sort multiple Map task, the result is a key area of shuffle and sort the task of merging multiple map files

(4), there will be a plurality of files after the shuffle, a corresponding plurality reduce tasks, all values for each partition of value as a function of input reduce

 

 

6, summed up the point:

(1), the output on the disk, the reduce operation map store intermediate results of operations in the HDFS.

(2), map task processing is a InputSplit, reduce the processing task is to map all of the same tasks partitioned intermediate result.

(3), merge sort is actually the same sort the data partition.

(4), executed is the mapper class file and the contents of the line number of the line as an input, the map method is called once for each row. Reducer perform the same class as the input partition, partition the same time calling the reduce method (see Examples join table).

(5), mapTask number (Mapper) depending on the input tile number (InputSplit). (the reducer) reduceTask default task number 1 , each set of code required job.setNumReduceTasks (50)

 

 

7, control shuffle

(1), write your own Partitioner

There are also generic Partitioner, because patitioner operation is not written to disk before the function map after data has been processed, he indicates the generic type of the result output by the map function. The method of treatment is the result getPartiion map function without treatment output, and the received key value represents the output of the map function. hadoop default partition the key is whether the same

(2), the sort control rule. In ordering the MapReduce framework will be executed, but the rules we can control.

hadoop default sort is to pass key value pairs compare compare methods were compared

 

 

 

 

 

Published 159 original articles · won praise 75 · views 190 000 +

Guess you like

Origin blog.csdn.net/xuehuagongzi000/article/details/76736979