Detailed explanation of the working principle of MapReduce

Detailed explanation of the working principle of MapReducer

A Xiaobai who just started to learn big data is willing to share what he has learned with everyone.

The following picture is made by me one by one, each step is marked, and it introduces the working principle of MapReducer in detail.
Insert picture description here

The specific Shuffle process is as follows:

1) MapTask collects the kv pairs output by our map() method and puts them in the memory buffer
2) Continuously overflowing local disk files from the memory buffer, which may overflow multiple files
3) Multiple overflow files will be merged into a large the spill file
4) in the spillover process and the process of the merger, should call Partitioner to partition and sort for Key
5) ReduceTask according to their partition numbers, to take the appropriate result partition data on each machine MapTask
6) ReduceTask will take To the result files from different MapTasks in the same partition, ReduceTask will merge these files again (merge and sort).
7) After merging into large files, the Shuffle process is over, and then the logical operation process of ReduceTask (from file Take out the key-value pair Group one by one and call the user-defined reduce() method)

note

1) The size of the buffer in Shuffle will affect the execution efficiency of the MapReduce program. In principle, the larger the buffer, the fewer the number of disk io and the faster the execution speed.
2) The size of the buffer can be adjusted by parameters. The parameter: io.sort.mb is 100M by default.

Thanks for watching, please point out if you have any questions.
Previous: HDFS common API operations https://blog.csdn.net/qq_40169189/article/details/105546278

Guess you like

Origin blog.csdn.net/qq_40169189/article/details/105561295