Introduction to Big Data (4) Introduction to MapReduce and Detailed Workflow

MapReduce is a parallel programming model for parallel computing of large-scale data sets, which can process massive data sets above TB level in parallel in a reliable and highly fault-tolerant manner. Map (mapping) and Reduce (statute) are its main ideas.

 1. Overview of MapReduce workflow

A MapReduce Job (job) is a unit of work that the client needs to execute: it includes input data, MapReduce programs, and configuration information. Hadoop divides the job into several tasks (tasks) to execute. Each task includes two types of tasks: map tasks and reduce tasks. These tasks run on the nodes of the cluster and are scheduled through YARN. If a task fails, it will be automatically rescheduled for execution on another node.

The entire MapReduce processing flow is shown in the figure above.

Among them , Map is a mapping, which is responsible for filtering and distributing data, converting raw data into key-value pairs; Reduce is merging, processing values ​​with the same key value and then outputting new key-value pairs as the final result. In order for the reduce to process the results of the map in parallel, the output of the map must be sorted and divided, and then handed over to the corresponding reduce. This process of further sorting the map output and handing it over to the reduce is called Shuffle .

The general flow of a MapReduce that counts the number of words is as follows:

2. The Map stage of the MapReduce workflow

insert image description here

Suppose there is a text file to be processed: ss.txt with a size of 200MB.

1. Slicing (splitting): Hadoop splits the input data of MapReduce into some small data blocks, called InputSplit ( InputSplit ). Hadoop will create a MapTask for each shard, and the task will run the user-defined map function to process each record in the shard. Here you need to pay special attention to the difference between block and split: block (block) is a physical division that stores the specific content of the file. The split is a logical division, which only stores the meta information of the file (HDFS file address, the starting position of the split, and the file length of the split), which is used for MapTask to obtain the actual file content.

The class that splits input data into input splits (InputSplit) is InputFormat , and its main functions are as follows:

  • Divide the input data into multiple logical InputSplits, where each InputSplit is used as the input of a map.
  • Provide a RecordReader for converting the content of InputSplit into k, v key-value pairs that can be input as a map.

Write picture description here

As shown above, InputFormat is an abstract class. Hadoop uses TextInputFormat by default, and the default slice size is equal to the block size (128MB), so the ss.txt input file will be divided into two InputSplits when slicing: 0-128MB and 128-200MB. Finally, Hadoop will write the slice information to a slice plan file.

Extended reading on InputFormat: Introduction to MapReduce InputFormat Introduction to     InputFormat Subclasses

2. Submit (Submit): The client makes a request to the Yarn cluster to create Mr appmaster and submit related information such as slices: job.split, wc.jar (required only in cluster mode), job.xml. Yarn calls ResourceManager to create Mr appmaster, and Mr appmaster creates MapTask according to the number of slices (equivalent to the number of InputSplits).

3. Map phase: At this point, the Map phase officially begins.

The whole MapTask is divided into Read phase, Map phase, Collect phase, spill (overwrite) phase and combine phase

  • Read stage: MapTask first calls the createRecoderReader method in InputFormat to obtain RecordReader . And use RecordReader to parse out key-value pairs from InputSplit, and then pass them to the map method.
  • Map stage: This stage is mainly to hand over the parsed key-value pairs to the user to write the map() function for processing, and generate a series of new key-value pairs, and finally write them to the local hard disk (because the output of Map is an intermediate result, the result will be deleted after the job is completed, so there is no need to store it in HDFS)

At this point, in fact, the logical processing of the map stage has ended, and we can directly pass this intermediate result to reduce for processing. However, in order to reduce the data communication overhead, the intermediate result data will be merged before entering the Reduce node. The data processed by a Reduce node generally comes from many Map nodes. In order to avoid data correlation during the Reduce calculation phase, the intermediate results output by the Map node must be properly divided using a certain strategy to ensure that the relevant data is sent to the same Reduce node. In addition, the system also performs some calculation performance optimization processing, such as using multiple backups for the slowest calculation task, and selecting the fastest finisher as the result, etc.

Therefore, before passing the intermediate results generated in the Map stage to Reduce, we need to perform shuffle operations on the Map side : data partitioning, sorting, and caching.

  • Collect (collection) phase: In the map() function, when the data processing is completed, OutputCollector.collect() will be called to partition the generated data (call Partitioner ), and write it into a ring memory buffer. The ring buffer is mainly composed of two parts, one part writes the metadata information of the file, and the other part writes the real content of the file. The default size of the ring buffer is 100MB. When the buffer capacity reaches 80% of the default size, reverse overflow will be performed .
  1. Before overflow writing, the data in the buffer will be partitioned and sorted according to the specified partition rules. The reason for reverse overflow writing is that it can overflow write data to the disk while receiving data
  2. After partitioning and sorting, overflow to disk, may occur multiple times, overflow to multiple files
  3. Merge sort all files that are spilled to disk
  4. Finally, Hadoop allows users to specify a Combiner (merge) for the output of the map task to locally summarize the output of each MapTask to reduce the amount of network transmission. But no matter how many times the Combiner is called (0 or N times), the output of Reduce should be consistent. As follows, if the maximum value is required, it is possible to use the Combiner to find the maximum value (local summary) of the output results of each map task. But if the average value is required, it is wrong to use the Combiner to find the local average.

  • Spill (overflow) phase: When the ring buffer is full, MapReduce will write the data to the local disk to generate a temporary file. It should be noted that before writing the data to the local disk, the data must first be sorted locally, and the data should be merged and compressed if necessary.

3. The Reduce stage of the MapReduce workflow

insert image description here

 After all MapTask tasks are completed, the corresponding number of ReduceTasks will be started according to the number of partitions, and the data range processed by the ReduceTask will be notified (that is, data partitions. If there are several data partition partitions, several ReduceTasks will be started. Each ReduceTask is dedicated to processing the data of the same partition. For example, ReduceTask1 is dedicated to processing the data of partition0 in MapTask1 and partition0 in MapTask2).

The whole ReduceTask can be divided into Copy stage, Merge stage, Sort stage, and Reduce stage. Among them, the Copy stage, the Merge stage and the Sort stage belong to the shuffle operation on the Reduce side .

1. Copy stage: According to its own partition number, ReduceTask goes to each MapTask machine to copy the data in the corresponding partition to the local memory buffer. If the buffer is not enough, it overflows and writes to the disk.

2. Merge phase: While copying data remotely, ReduceTask starts two background threads to merge files in memory and on disk to prevent excessive memory usage or too many files on disk.

3. Sort stage: According to the semantics of MapReduce, the user writes the reduce() function to input data that is a group of data aggregated by key. In order to gather data with the same key together, Hadoop adopts a sorting-based strategy. Since each MapTask has implemented partial sorting of its own processing results, the ReduceTask only needs to merge and sort all the data once.

4. Reduce phase: Execute the reduce() function and write the final result to HDFS .

4. References

https://zhuanlan.zhihu.com/p/85666077

https://www.pianshen.com/article/63431009288/

https://zhuanlan.zhihu.com/p/139180607

Guess you like

Origin blog.csdn.net/qq_37771475/article/details/118877313