Diagram of ReduceTask working mechanism

Insert picture description here

(1) Copy stage: ReduceTask remotely copies a piece of data from each MapTask, and for a piece of data, if its size exceeds a certain threshold, it is written to the disk, otherwise it is directly placed in the memory.

(2) Merge stage: While copying data remotely, ReduceTask starts two background threads to merge files on memory and disk to prevent excessive memory usage or too many files on disk.

(3) Sort stage: According to MapReduce semantics, the input data of the reduce() function written by the user is a set of data gathered by key. In order to gather data with the same key together, Hadoop uses a sort-based strategy. Since each MapTask has implemented partial sorting of its processing results, ReduceTask only needs to merge and sort all data once.

(4) Reduce phase: The reduce() function writes the calculation result to HDFS.

Guess you like

Origin blog.csdn.net/weixin_46457946/article/details/114287642