Demonstration of two data flow models of MapReduce programming

MapReduce is a programming model used to process parallel operations on large-scale data sets. When using MapReduce to perform computing tasks, the execution process of each task will be divided into two stages, namely Map and Reduce. The Map stage is used to process the original data, and the Reduce stage is used to summarize the results of the Map stage. , to obtain the final result, the model of these two stages is shown in Figure 1.

  Figure 1 MapReduce simple model

  The MapReduce programming model draws on the design ideas of functional programming languages, and its program implementation process is completed through the map() and reduce() functions. From the perspective of data format, the data format received by the map() function is a key-value pair, and the output result generated is also in the form of a key-value pair. The reduce() function will take the key-value pair output by the map() function as input and convert the same key The value of the value is summarized and a new key-value pair is output. Next, a diagram is used to describe the simple data flow model of MapReduce, as shown in Figure 2.

 Figure 2 MapReduce simple data flow model

  ​ Regarding the MapReduce simple data flow model described in Figure 2, the details are as follows:

  (1) Process the original data into key-value pairs.

  (2) Pass the parsed key-value pairs to the map() function. The map() function will map the key-value pairs into a series of key-value pairs in the form of intermediate results according to the mapping rules.

  (3) Pass the intermediate form of key-value pairs to the reduce() function for processing, merge the values ​​with the same key together, and generate a new key-value pair. The key-value pair at this time is the final output result.

  ​ It should be noted here that for some tasks, the Reduce process may not necessarily be required. In other words, the data flow model of MapReduce may only have the Map process, and the data generated by the Map is directly written to HDFS. However, for most tasks, a Reduce process is required, and multiple Reduces may need to be set due to heavy tasks. For example, the following is a MapReduce model with multiple Maps and Reduces, as shown in Figure 3 .

 Figure 3 MapReduce model of multiple Maps and Reduces

  Figure 3 demonstrates a MapReduce program containing 3 Maps and 2 Reduces. Among them, the output of relevant keys generated by Maps will be concentrated in Reduce for processing, and Reduce is the final processing process, and the results will not be processed a second time. Summary.

Guess you like

Origin blog.csdn.net/zy1992As/article/details/132667412