MapReduce Framework Notes - Detailed Shuffle Process

0x0 background

Map-reduce is a computing framework that comes with hadoop. Although most projects do not use this framework for computing now (memory-based computing frameworks such as Spark are more efficient), its principles are worth studying. The core of the map-reduce framework is the process of shuffle. Let's record the understanding of shuffle.

0x1 Map Shuffle

First look at the pictures in The Definitive Guide to Hadoop
write picture description here

For a map task, you can see the following steps:
1. Input data in chunks as input to a map task
2. After the map task is processed, output the result to the memory buffer (this is a ring buffer, the default size is 100M )
3. Before spilling the result to disk, three important processings are required:
(1) Partitioning (by default, partitioning is performed according to the number of reducers and the hash value of the key)
(2) Sorting (for each partition The data is sorted by key)
(3) If the user defines a combiner, execute combiner
4. Once the data in the buffer reaches the threshold (the default is 0.8 of the total size of the buffer), the data in the buffer will be overflowed to the disk, and one will be generated each time. spill file.
5. Finally, merge each file into one file. If the map task generates more than three spill files, the combiner will be executed again.

0x2 Reduce端Shuffle

Still look at the picture first:
write picture description here
As you can see from the previous map picture, the result file generated by each map is divided into N partitions, each of which corresponds to a reducer.
Once a map task is completed , each reducer will copy the corresponding partition in the file output by the map , and then there will be a background thread on the reducer side to merge these files into one file (if there are too many files, it often takes several rounds of merge) , and will be sorted during the merge process. So how does the reducer know when to go to which node to get its partition? The original answer:

How do reducers know which machines to fetch map output from?
As map tasks complete successfully, they notify their application master using the heartbeat mechanism. Therefore,
for a given job, the application master knows the mapping between map outputs and hosts. A thread in the reducer
periodically asks the master for map output hosts until it has retrieved them all.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326005152&siteId=291194637