At the end of each maptask we get is <K, V> queue, in Reduce, the input is <K, Iterable V>. It is called a working Shuffle in the middle, the data in the sort Maptask Key. The main work 1. Roughly speaking complete pull data from the map task to reduce terminal end. 2. When pulling data across nodes, as much as possible to reduce the unnecessary consumption of bandwidth. 3. reduce the impact of disk IO on task execution. (The main work is to optimize the use of more memory rather than disk IO)

Partially switched https://www.cnblogs.com/sunfie/p/4928662.html ;

MapTask的Shuffle

Map piece from the four in terms Shuffle, map results are written to disk; cache data partition (Partition) Packet (Combiner) Sort (Sort); file merge (Merge);

1.map result is written to disk

For parallel map work, produced many results output, first of all we are talking about the results obtained on each node saved to a memory buffer in the ring (filled to 80%, going out to get the data, or full, and the remaining 20% continue reading map data), when the buffer is full, and then map it blocked it until the end and then continue writing to disk. Write out the file called spill files (Spill), the overflow writing is done by a separate thread, does not affect the write buffer to map the results of threads, it should not prevent the overflow output map results when writing thread starts.

2. Cache partition data (the Partition) packet (Combiner is) sort (the Sort)

The first is the partition (Partition): After we get the data, you need to know to get the map of <k, iteration V> which reduceTask to go to completion. MapReduce provides Partitioner interfaces, its role is to decide which of the current output data which reduce task should be handed over to the final processing based on the number or value and reduce the key. The default number of key hash and then to reduce task modulo, modulo the default way just to reduce the average processing power, if the user's own demand for Partitioner, can be customized and set to the job.

Then the packet (Combiner): Take sub WordCount example, produces two <Hello, 1> <Hello, 1> Thus, when the transmission data in a transmission maptask two <Hello, 1> ReduceTask past, too wasteful , it is better to become <Hello, 2> re-transmission of the past. Combiner and reducer like (in fact reducer is Combiner ...), reduce from <K1, V1> become <K2, V2>, the difference combiner is from <K, V> become <K, V>

Finally say sort (Sort): spill thread in the buffer before the data is written to disk, it will be a second quick sort, first sort the data according to the partition belongs, and then press the Key for each partition in order. Output includes an index file and data file. If you set Combiner, it will run on the basis of the output of the sort.

3. File Merge (Merge)

Every spill will generate a spill file on disk, the output if the map is really great, there are many such spill occurred on the corresponding disk will have multiple spill files exist, when the real map task completion, data memory buffer overflows are all written to disk to form a spill write file. The final disk will have at least exist (if the output map is small, when the map execution is complete, it will only generate a spill write file) overflow write such a document, because in the end only one file, so you need to write these overflow file merge together, this process is called Merge.

ReduceTask的Shuffle

Reduce Shuffle stage is divided into three steps to understand: take data from the Map; merge data;

1. Take data from the Map

At the end of all the files into a Merge map, reducetask from maptask here to get this file (map fast or slow, well Take), if the file is small, direct deposit JVM, the big words are written by overflow to operate , can also be incorporated by Combiner

2. merge data

Get the data and get the data in memory with overflow write files to organize into a single file, so in addition to merging after get data in different maptask, the last time also need to merge, all the <K, Iterable V> merge to one. Finally default written to a file, get Reducer need <K, Iter V>

Hadoop [2.1] Shuffle Overview