Briefly describe the Shuffle process of Mapreduce

Primer

Although we only need to focus on writing the map function on the Map side and the reduce function on the Reduce side when writing a Mapreduce program, the Shuffle process is the core of the Mapreduce workflow, and understanding the Shuffle process is the core key to our understanding of the Mapreduce workflow.

The brief workflow of Mapreduce can be seen in the figure below

insert image description here

From the figure, we can see that the workflow of Mapreduce is divided into Map, Shuffle, and Reduce. The Shuffle process spans the Map side and the Reduce side, while the map task on the Map side and the reduce task on the Reduce side are not included in the Shuffle process.

The detailed flow of the Shuffle process may be described in more detail in the figure below, and the following figure will be explained below

insert image description here

Shuffle process on Map side

On the Map side, the data processed by the map task - a series of <k, v> key-value pairs, first enters the cache, and when the cached data reaches a certain capacity (the overflow ratio of the cache space), the Map will be started The process of overflow writing on the end of the map, which overflows the output of the map task to the disk. Before overflowing to the disk, it will go through the Shuffle processing stage on the Map end—partitioning, sorting, and merging (optional, you need to define it yourself, in order to briefly describe the most Simplification, not here), after overflowing to disk, multiple overflow files will undergo a merge process.

It is worth mentioning that the input and output of Mapreduce are saved in the distributed file system, and the intermediate results are saved in the local disk (and this has also become a weak point in the speed of Mapreduce), so the Shuffle process will Involves a lot of local disk operations.

partition data

Each output key-value pair <k,v> in the cache processed by the map task is hashed by the hash function and then moduloed by the number of reduce tasks, so that a series of <k,y> keys The value pair is divided into multiple regions (the number of partitions corresponds to the number of reduce tasks), and the data in each region is handed over to the corresponding reduce (one region of an overflow file corresponds to one reduce task) for processing to achieve parallel computing.

Sort data

After partitioning the data, all key-value pairs of each partition will be sorted according to the key.

After the cached data is partitioned and sorted, it will be overflowed to the disk. Every overflow will generate an overflow file and clear the corresponding data in the cache.

Merge files

An overflow file will be generated for each overflow write, so that as the map task continues to execute, more and more overflow files will be piled up on the disk. Therefore, after the map task is completed, the overflow write in the disk will be File execution file merging, merging multiple overflow files into one large overflow file

The Shuffle process on the Reduce side

Receive data

When all the map tasks are completed, the Reduce end will receive a notification to go to the Map end to retrieve the partition data that belongs to itself. Since there are multiple Map ends, the Reduce end will naturally split multiple threads to different Map ends. Get the data on.

Merge data and files

When all the data is retrieved, it will be put into the cache first. When the cached data is full, it will be overflow written to the disk. Before the overflow is written to the disk, the data will be merged again, that is, the data of the same key will be It is merged to generate an overflow file, so that as the data pulled back by Reduce continues to increase and the buffer is full over and over again, multiple overflow files will naturally be generated.

When all the data on the Map side is retrieved, the overflow files on the disk will also be merged into a large overflow file. Merging multiple overflow files on the disk into a large overflow file may require multiple rounds of merging. The number of files that can be merged in each round of merging is controlled by the value of the parameter io.sort.factor (default is 10) , assuming that 50 overflow files are generated on the disk, and 10 overflow files can be merged in each round, 5 rounds of merging are required to generate 5 overflow files.

Several files generated through multiple rounds of merging on the disk will not be merged into one large file, but will be merged in memory, and will be input to the reduce task after merging, which can reduce disk read and write overhead.

It should be noted that in multiple rounds of merging, the selection strategy for the number of merged files in each round of merging will try to make the number of merged files in the last round sufficient for the value of io.sort.factor. So if there are 40 files, we don't merge 10 files in each of four passes to get 4 files. Instead, the first pass merges only 4 files, and the subsequent three passes merge the full 10 files. In the last pass, the 4 merged files and the remaining 6 (unmerged) files add up to 10.

Guess you like

Origin blog.csdn.net/atuo200/article/details/108111506