MapTask-ReduceTask process

MapTask

  • Read Stage: MapTask RecordReader written by the user, parsing one key / value from the input InputSplit

  • Map phase: The main node is parsed key / value written to a user map () function handles, and generate a new set of key / value

  • Collect collection phase: prepared map of the user () function, when the data processing is completed, usually call OutputCollector.collect () output. Inside the function, it will generate a key / value partition (Partitioner call), and written into a ring buffer memory

  • Spill stage: when the ring buffer is full, MapReduce will write data to a local disk, create a temporary file. Note that, before writing data to a local disk, the data must first conduct a local sorting, and data consolidation, if necessary, compression and other operations, the write phase overflow details:

    • Step 1: Use quick sort algorithm to sort the data in the cache area, sorting is first sorted according to the partition number Partition, and then follow the sort key. In this way, after sorting, data partition to come together as a unit, and all of the data in accordance with the same partition key order

    • Step 2: turn the data in each partition write temporary files under the mandate of the working directory according to the partition number from small to big. If the user sets the Combiner is, before writing the file, the data in each partition aggregation operation once

    • Step 3: partition data meta-information is written in the memory SpillRecord index data structure, wherein the meta information of each partition comprising an offset in the temporary file, the data size before compression and the data size after compression

  • Combine phase: Once all the data processing is complete, MapTask for all temporary files once the merger to ensure that it will eventually generate a data file

ReduceTask

  • Copy stage: ReduceTask from each MapTask a remote copy of data, and data for a certain one, if its size exceeds a certain threshold, then written to disk, or directly into memory

  • Merge stage: while remote copy of the data, ReduceTask background thread started two files on disk and memory consolidation, in order to prevent the excessive use too much memory or disk file

  • Sort phase: prepared according MapReduce reduce semantics, the user () function of the input data is a set of data gathered by the key. To the same key data together, Hadoop based sorting strategy employed. Because each MapTask have achieved their treatment results were sort of local, therefore, ReduceTask only once all the data to merge sort

  • Reduce Stage: reduce () function writes the calculation result on the HDFS

Guess you like

Origin www.cnblogs.com/xiangyuguan/p/11316422.html