Hadoop — Analysis of the Principles of MapReduce

1 Overview

Mapreduce is a programming framework for distributed computing programs, and is the core framework for users to develop "hadoop-based data analysis applications";

The core function of Mapreduce is to integrate the business logic code written by the user and the built-in default components into a complete distributed computing program, which runs concurrently on a hadoop cluster;

1.1 The birth background of MapReduce

Background reasons:
(1) The processing of massive data on a single machine cannot be performed due to the limitation of hardware resources;
(2) Once the single-machine version of the program is extended to a cluster for distributed operation, it will greatly increase the complexity of the program and the difficulty of development;
( 3) After the introduction of the mapreduce framework, developers can concentrate most of the work on the developers of business logic, and leave the complexity of distributed computing to the framework;

It can be seen that when the program is expanded from a stand-alone version to a distributed one, a lot of complex work will be introduced. To improve development efficiency, common functions in distributed programs can be encapsulated into a framework, allowing developers to concentrate on business logic. And mapreduce is such a general framework for distributed programs.

2. MAPREDUCE framework and core operating mechanism

2.1 Framework Architecture

A complete mapreduce program has three types of instance processes during distributed operation:
1. MRAppMaster (Mapreduce application master): responsible for the process scheduling and state coordination of the entire program
2. MapTask: responsible for the entire data processing process in the map stage
3. ReduceTask: responsible for The entire data processing flow in the reduce stage

2.2 MapReduce program running process

(1) When an mr program starts, MRAppMaster is started first. After MRAppMaster starts, it calculates the required number of map task instances according to the description information of this job, and then applies to the cluster for a machine to start the corresponding number of map task processes;

(2) After the map task process is started, data processing is performed according to the given data slice range. The main process is:
    a) Use the input format specified by the customer to obtain the data read by the RecordReader, and form the input KV pair;
    b) Pass the input KV pair Perform logical operations on the map() method defined by the customer, and collect the KV pairs output by the map() method into the cache;
    c) After sorting the KV pairs in the cache according to the K partition, the overflow is continuously written to the disk file;

(3) After MRAppMaster monitors the completion of all map task processes, it will start the corresponding number of reduce task processes according to the parameters specified by the customer, and inform the reduce task process of the data range (data partition) to be processed;

(4) After the Reduce task process is started, according to the location of the data to be processed notified by MRAppMaster, several map task output result files are obtained from the machine where several map tasks are running, and re-merge and sort locally, and then according to the same key The KV is a group, call the reduce() method defined by the customer to perform logical operations, collect the result KV of the operation output, and then call the outputformat specified by the customer to output the result data to the external storage.

2.2.1 Analysis of the running process of the example wordcount of MapReduce

 

Description :

 ① The client client obtains the information of the data to be processed (such as how many files to be processed, the size of the files to be processed), and then forms a task allocation plan according to the parameter configuration;

 ② The client client submits relevant files such as job.split, wc.jar and job.xml to Yarn, and applies for resources;

 ③ Yarn starts Mr AppMaster first;

 ④ MRAppMaster calculates the required number of map task examples according to the description information of this job provided by Yarn, and then applies for a machine to the cluster and starts the corresponding map task process. Here, three map tasks are started;

 ⑤ The map task reads the corresponding data from the "a.txt 0-128" data slice range allocated according to the task allocation plan, and reads the corresponding data from it through InputFormat to form an input (K, V) key-value pair;

 ⑥ Then use the map() method of wordcountMapper to do logical operations to process the input (K, V) key-value pair;

 ⑦ Cache the (K, V) key-value pair output by the map method to the outputCollector;

 ⑧ The (K, V) key-value pair output by the map task is subjected to operations such as partitioning and sorting in the shuffle process to form a new (K, V set) key-value pair;

 ⑨ reduce task performs logical processing on key-value pairs of (K, V set);

 ⑩ reduce task output (K, V) key-value pair to outputFormat;

 ⑪ outputFormat outputs the passed (K, V) key-value pair to text files such as part-0000*;

3. The process for the client to submit the mr program Job to Yarn

This process is the first and second steps in "2.2.1 MapReduce's sample wordcount running process analysis".

 

The job.waitForCompletion() method will call the job.submit method, which contains the JobSubmiter class, which has a Cluster class member, and the Cluster class has a proxy member, which is responsible for proxying YarnRunner and proxying LocalJobRunner. YarnRunner is responsible for submitting job resource paths to Yarn, while LocalJobRunner submits job resource paths to local. The running process of YarnRunner and LocalJobRunner is the same, and both objects will generate StagingDir and JobID. The JobSubmiter also calls the FileInputFormat.getSplits() method to obtain the slice plan List<FileSplit> and serializes it to generate the job.split file, creates the job.xml according to the job-related parameters set by the client, and finally obtains the jar package of the job. StagingDir, JobID, job.spilt, job.xml and jar packages are spliced ​​together into a job resource submission path.

4. Map Task parallelism decision mechanism

The parallelism of the map task determines the task processing concurrency in the map phase, which in turn affects the processing speed of the entire job. The degree of parallelism in the map phase of a job is determined by the client when submitting the job, and the basic logic of the client's planning for the degree of parallelism in the map phase is: perform logical slices of the data to be processed (that is, according to the size of a specific slice, the The data is divided into multiple logical splits), and then each split is assigned a parallel instance of mapTask for processing. This logic and the resulting slice plan description file job.split are completed by the getSplits() method of FileInputFormat. The task slicing planning process that determines the parallelism of the map stage is the first step in "2.2.1 MapReduce's Example Wordcount Running Process Analysis". The slice planning process is as follows:

Slices are defined in the getSplit() method in the FileInputFormat class.

The default slicing mechanism in FileInputFormat:
a) simply slice according to the content length of the file;
b) the slice size, which is equal to the block size by default;
c) the whole dataset is not considered when slicing, but each file is sliced ​​individually one by one

Example:

Parameter configuration of the size of slices in FileInputFormat

By analyzing the source code, in FileInputFormat, the logic of calculating the slice size: Math.max(minSize,Math.min(maxSize,blockSize)); the slice is mainly determined by these values:

minSize

Default: 1

Configuration parameters: mapreduce.input.fileinputformat.split.minSize

maxSize Default: Long.MAXValue Configuration parameter: mapreduce.input.fileinputformat.split.maxSize
blockSize The size of the slice by default  

By default, slice size = blockSize

maxSize (maximum value of slice): If the parameter is adjusted to be smaller than blockSize, the slice will be made smaller, and it will be equal to the value of this parameter configured;

minSize (minimum value of slice): If the parameter is adjusted larger than blockSize, the slice can be made larger than blockSize.

Lessons learned from map parallelism:

Factors influencing the selection of the number of concurrency: ①The hardware configuration of the computing node; ②The type of computing task: CPU-intensive or IO-intensive; ③The data volume of the computing task;

If the hardware configuration is 2*12core + 64G, the appropriate map parallelism is about 20~100 maps per node, and the execution time of each map is preferably at least one minute .

If the running time of each map or reduce task of the job is only 30-40 seconds, then reduce the number of maps or reduce of the job; the setup sum of each task (map|task) is added to the scheduler for scheduling, This intermediate process can take several seconds, so if each task runs very quickly, it will waste too much time at the beginning and end of the task. Configuring the JVM reuse of tasks can improve this problem: mapred.job.jvm.num.tasks, the default is 1, indicating that the maximum number of tasks that can be executed sequentially on a JVM (belonging to the same Job) is 1, that is, a task starts a JVM.

If the input file is very large, such as 1TB, you can consider setting the size of each block on hdfs to be larger, such as 256MB or 512MB.

5. Reduce Task parallelism decision mechanism

The parallelism of the Reduce task also affects the execution concurrency and execution efficiency of the entire job. However, unlike the concurrency of the map task, which is determined by the number of slices, the number of Reduce tasks can be directly set manually.

//The default value is 1, manually set to 4
job.setNumReduceTasks (4);

If the data distribution is not uniform, it is possible to generate data skew in the reduce phase

Note: The number of reduce tasks is not set arbitrarily, and business logic requirements must be considered. In some cases, there can only be one reduce task when the global summary result needs to be calculated.

Try not to run too many reduce tasks. For most jobs, it is best to have the same number of reduces as the cluster has, or less than the cluster's reduce slots. This is especially important for small clusters.

6. The shuffle mechanism of MapReduce

6.1 Overview

In MapReduce, how the data processed in the map stage is passed to the reduce stage is the most critical process in the mapreduce framework. This process is called shuffle;
shuffle: shuffling and issuing cards - (core mechanism: data partitioning, sorting, caching ) ;
Specifically: it is to distribute the processing result data output by maptask to reduce task, and in the process of distribution, the data is partitioned and sorted by key ;

5.2 Shuffle process

Shuffle is a process in the MR processing flow. Each processing step is completed on each map task and reduce task node. Overall, it is divided into three operations:
(1) Partition partition
(2) Sort according to Key sorting
(3) Combiner combines local values

1. ①②③④map task reads a file by reading one line at a time through TextInputFormat(--> RecordReader --> read()) and returns (key, value);

2. ⑤ The (key, value) key-value pair obtained in the previous step is logically processed by Mapper's map method to form a new (k, v) key-value pair, which is output to the OutputCollector through context.write;

3. ⑥OutputCollector writes the collected (k,v) key-value pairs into the ring buffer. The default size of the ring buffer is 100M, and only 80% is written (the ring buffer is actually an array, which is written in the front, followed by A component cleans this up, writes it to a file, and prevents overflow). Spill overflow is triggered when the data in the ring buffer reaches 80% of its size;

4. The data needs to be partitioned and sorted before spill overflows, that is, a partition value will be hashed for each (k, v) key value pair in the ring buffer. The same partition value is the same partition, and then the ring buffer will be stored in the same partition. The data is sorted in ascending order according to the partition value and the key value; the data in the same partition is sorted according to the key;

5. ⑧ Spill the sorted memory data in the ring buffer to the local disk file continuously. If the amount of data processed in the map stage is large, multiple files may be spilled;

6. ⑨ Multiple overflow files will be merged into a large overflow file by merge, and merge sort is adopted, so the final result file of the merged maptask is still partitioned and ordered within the area;

7. ⑩ The reduce task copies the data of the same partition on each map task node to the local disk working directory of the reduce task according to its own partition number;

9. The ⑪reduce task will merge the result files from different map tasks in the same partition and merge them into a large file (merge sort). The contents of the large file are ordered according to k;

10. After ⑫⑬ is merged into a large file, the shuffle process is over, and then the logic operation process of the reduce task is entered. First, the GroupingComparator is called to group the data in the large file, and a group of (k,values) is taken out from the file at a time. ) key-value pair, call the user-defined reduce() method for logical processing;

11. ⑭⑮ Finally, write the result data to the part-r-000** result file through the OutputFormat method;

Note : The size of the ring buffer in shuffle will affect the execution efficiency of the MapReduce program. In principle, the larger the buffer, the less the number of disk IOs, and the faster the execution speed. The size of the ring buffer can be changed by setting the value of mapreduce.task.io.sort.mb in mapred-site.xml, the default is 100M. When the overflow occurs, the combiner component is called. The logic is the same as that of reduce. Merge, the same key and value are added together. This way, the transmission efficiency is high. A lot of network bandwidth and reading and writing of local disk IO streams, specific code implementation: define a combiner class, integrate Reducer, the input type is the same as the output type of map.

Regarding optimization strategies for a large number of small files:

(1) By default, TextInputFormat's slicing mechanism for tasks is to plan slices by file. No matter how small the file is, it will be a separate slice and will be handed over to a maptask. In this way, if there are a large number of small files, a large number of small files will be generated. maptask, the processing efficiency is extremely low;

(2) Optimization strategy: The best method is to combine small files into large files at the front end of the data processing system (preprocessing/collection), and then upload them to HDFS for subsequent analysis. The remedy is, if there are already a large number of small files in hdfs, you can use another InputFormat to do slicing (CombineFileInputFormat), its slicing logic is different from FileInputFormat, it can logically plan multiple small files into one slice , so that multiple small files can be handed over to a maptask for processing.

7. Serialization in MapReduce

7.1 Overview

Java serialization is a heavyweight serialization framework (Serializable). After an object is serialized, it will come with a lot of additional information (various verification information, header, inheritance system, etc.), which is not convenient for efficient transmission in the network. ; Therefore, hadoop has developed a set of serialization mechanism (Writable), which is streamlined and efficient.

7.2 Custom objects implement serialization interface in MR

If you need to transfer the custom bean in the key, you also need to implement the Comparable interface, because the shuffle process of mapreduce must sort the keys. At this time, the interface implemented by the custom bean should be: public class FlowBean implements WritableComparable <FlowBean>

References:

https://www.jianshu.com/p/fc36464f4c6d

https://www.cnblogs.com/licheng/p/6687018.html

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325063703&siteId=291194637