MapReduce running process analysis

Here Insert Picture Description
mapreduce calculation process is divided into two processes: map stage and reduce phase
1. During data calculation, data is first acquired by the object DistributeInputStream

2. Then a block of data is sliced ​​by a certain offset, the default size of 128MB slice, and each slice corresponds to a map set, the count of the word for it, the key map is set each offset the amount of data, which is the default value is 1

3. The map is then quickly sort, the workflow on the node of the other work the same way, so that by placing the sorted data to the same key value together

4. Then reduce to determine if the task map execution completed 80% of the reduce began to perform tasks and concentrate to obtain data from the already sorted data by http way to get based on the data:. MapReduce primitive: "the same" key a key value is set, a call Reduce method, this method iterates over this set of data is calculated. When acquiring the data set, the map data in conjunction with the same key value is acquired (including obtaining the data on the other nodes of the parallel processing of data), map tasks to find that task if there is not finished, then reduce will first be real-time computing, a merge sort, after waiting on other tasks to perform map node is completed, the calculation of reduce, reduce the time during the calculation, the same key value is calculated the same statistics (number of words), and then calculate the completed data outputs the result to the HDFS for storage.

Reducce is to obtain data map output file according to the partition number via http. http end has a map service HTTP requests that the reducer. The maximum number of threads the HTTP service has maperduce. shuffle.Max.threads attribute. This attribute specifies the number of threads nodemanager, rather than the number of threads on the map task, because there may be a good nodemanager running multiple tasks, the default value is 0, which indicates the maximum number of threads is twice the number of kernel threads.

map output file is located on a local disk, a task reduce the need to obtain a specified task partition map data from a cluster of map tasks. There may be multiple map tasks completed at different times. Whenever the map task completion, reduce the specified partition will get data from the map task, for high efficiency, reduce data specified partition will get a multi-threaded approach. The default number of threads is 5, can be specified in the configuration file.
Overflow files written: When copying files to reduce the completion of the development of a partition from the map task, if large memory buffer size reaches a threshold value (mapreduce.reduce.shuffle.merge.percent), or map file output reaches a threshold value, (mapreduce. reduce.merge.inmem.threshold) then the file will be written on the disk overflow. If you specify a combiner merger will be here combiner

Once reduce copy the data from all the partition map, reduce entered a consolidation phase, the question then is how many times the merger, if there are 50 files, merge factor of 10 (mapreduce.task.io.sort.factor, the default is 10 ), it is necessary to obtain 5 5 intermediate output files are no longer incorporated directly.
Data to reduce the memory and disk generally in the form of mixed output.

Published 133 original articles · won praise 53 · views 20000 +

Guess you like

Origin blog.csdn.net/weixin_43599377/article/details/103466267