Principles and Applications of Big Data Technology Part III Big Data Processing and Analysis (1) MapReduce

Table of contents

Chapter 7, MapReduce

1. Overview of MapReduce

1.1 Differences between MapReduce and traditional parallel computing frameworks

1.2 Idea of ​​MapReduce

1.3 Map and Reduce functions

2. The workflow of MapReduce

2.1 Overview

2.2 Shuffle process

3. MapReduce programming


Chapter 7, MapReduce

Moore's Law : CPU performance doubles approximately every 18 months, gradually failing since 2005

Distributed parallel programming : When people no longer rely on CPU performance improvement, people start to use distributed parallel programming to improve program performance. Distributed parallel programs run on large-scale computer clusters, making full use of the parallel processing capabilities of the clusters.

1. Overview of MapReduce

1.1 Differences between MapReduce and traditional parallel computing frameworks

Traditional Parallel Computing Framework

MapReduce

Cluster Architecture/Fault Tolerance

HPC [high-performance computing cluster], shared (shared memory/shared storage), poor fault tolerance

Non-shared, good fault tolerance

Hardware/Price/Scalability

Blade server, high-speed network, SAN, expensive, poor scalability

Ordinary PC, cheap, good scalability

Programming/Learning Difficulty

what-how,难

what, simple

Applicable scene

Real-time, fine-grained computing, computing-intensive

Batch processing, non-real-time, data-intensive

1.2 Idea of ​​MapReduce

MapReduce highly abstracts the complex parallel computing process running on large-scale clusters into two functions: Map and Reduce

Design idea : "Calculation moves closer to data" rather than "data moves closer to computing", because mobile data requires a lot of network transmission overhead, and mobile computing is more economical and safe

MapReduce model : A large-scale data set stored in a distributed file system will be divided into many independent small data sets, and these small data sets are processed in parallel by multiple Map tasks. The MapReduce framework will input a small data set (shard) for each Map task, and the result of the Map task will be used as the input of the Reduce task, and finally the Reduce task will output the result and write it into the distributed file system

The premise of applying MapReduce processing : the data set to be processed can be decomposed into many small data sets, and each small data set can be processed in parallel

1.3 Map and Reduce functions

Both functions input a key-value pair and output another key-value pair with certain mapping rules

function

enter

output

illustrate

Map

Input file blocks from a distributed file system. The format of these file blocks is arbitrary, and the same element cannot be stored across file blocks.

[Keys are not unique]

< k1 ,v1> such as:

<line number, "abc">

[The same input element can generate multiple key-value pairs with the same key]

List(<k2,v2>)如:<“a”,1>、<“b”,1>

1. Further parse the small data set into a batch of <key, value> pairs, and input them into the Map function for processing

2. Each input < k 1, v 1> will output a batch of < k 2, v 2>. < k 2, v 2> is the intermediate result of the calculation

Reduce

<k2,List(v2)>

如:<“a”,<1,1,1>>

< k 3, v 3>

<“a”,3>

The task of Reduce is to combine the input key-value pairs with the same key in a certain way, and then generate output, and the output results will be merged into one file

The List( v 2 ) in the input intermediate result < k 2, List( v 2)> represents a batch of values ​​belonging to the same k 2

2. The workflow of MapReduce

2.1 Overview

1. Check and segment the input [logical segmentation], process the InputSplit generated by the segmentation through RecordReader, load the data and convert it into a key-value pair suitable for reading by the Map task

2. Split a large MapReduce job into multiple Map tasks. Each Map task is a parallel system without data transmission between each other [no additional data transmission overhead], and each Map task is stored in the Execute on the data node of the corresponding data [the calculation moves closer to the data, and the input file is local]

3. When the Map task ends, multiple intermediate results in the form of key-value pairs are generated according to user-defined mapping rules [intermediate results are saved locally]

4. When all the Map tasks are finished, the intermediate results are subjected to certain operations such as partitioning, sorting, merging, and merging to obtain key-value pairs in the format of <key,<value1,value2>>, which are unordered<key,value> The process of getting an ordered <key, valuelist> is called Shuffle [each Map task will perform Shuffle]

5. The intermediate results after Shuffle will be distributed to multiple Reduce tasks [implemented by the MapReduce framework itself], run on multiple machines [key-value pairs with the same key will be sent to the same Reduce task], and the Reduce task parallel relationship

6. The Reduce task summarizes the intermediate results according to the logic defined by the user, and outputs the final results to the corresponding directory of the distributed file system

P.S.

Split : HDFS stores data with a fixed-size block as the basic unit, while the processing unit of MapReduce is split. Split is a logical concept that only contains some metadata information, such as the starting position of the data, the length of the data, the node where the data is located, and so on. Its division method is entirely up to the user to decide.

Number of Map tasks : Hadoop creates a Map task for each split, and the number of splits determines the number of Map tasks. In most cases, the ideal shard size is one HDFS block

Number of Reduce tasks : The optimal number of Reduce tasks depends on the number of reduce task slots (slots) available in the cluster. Usually, the number of Reduce tasks is set to be slightly smaller than the number of reduce task slots (this can reserve some system resources. handle possible errors)

2.2 Shuffle process

shuffle : the process of partitioning, sorting, merging, merging, etc. the output results of the Map task and handing them over to Reduce

The process is divided into Map side operation and Reduce side operation

Shuffle process on Map side

1. Data input : input data. Generate multiple key-value pairs according to mapping rules

2. Execute the Map task : Each Map task will allocate a cache, and the key-value pairs output by the Map will be written into the cache. After a certain amount is accumulated, it will be written at one time [helps reduce addressing overhead and facilitate I/O operations]

3. Overflow writing (partitioning, sorting and merging) : When the data in the cache is almost full of memory, start the overflow writing operation [start another thread in the background]

Partitioning : partition the data in the cache by key-value pairs [default], use the hash function to process the key, and then take the modulus of the number of Reduce tasks [hash(key) mod R], which is used to assign the results to the corresponding Reduce tasks

Sorting : Sort the data in memory according to the key [default].

Combiner merge operation [optional] : You can sum the values ​​with the same key to get <key, value1+value2> [merge cannot change the final result], when files are merged, if the number of overwritten files is greater than the predetermined value (default 3) then You can start the Combiner again, less than 3 do not need

4. Write to disk : Finally, write the data to disk and clear the cache. Each overflow will generate a new overflow file.

5. File merging : At the end of all Map tasks, the system merges all overflow files to generate a large file, which is partitioned

6. Notify Reduce : JobTracker will always monitor the execution of Map tasks and notify Reduce tasks to receive data

The Shuffle process on the Reduce side

1. Receive data : Receive the intermediate results stored on the local disk of the Map machine and put them on the local disk of the Reduce task's own machine. Before the Reduce task starts, it has been receiving the data belonging to its own processing partition.

2. Merge data : The data received from the Map side will be placed in the cache of the Reduce task, and the overflow write operation will be performed in the same way. During the overflow write process, the data with the same key will be merged and then sorted. You can also use a user-defined Combiner to perform merge. When all the data is received, a large overflow write file will be generated. [The number of overflow files merged in each round is a set value, and the default is 10, that is, 50 overflow files need 5 rounds of merging operations to obtain 5 large overflow files]

3. Data input to Reduce : Some overwritten large files obtained will not continue to be merged, and can be directly input into the Reduce task to reduce disk read and write overhead.

3. MapReduce programming

The MapReduce programming framework is divided into three parts

The MapReduce programming framework is divided into three parts, and the following three classes are created respectively

public class WcMap extends Mapper<LongWritable, Text, Text, LongWritable>{
    //Rewrite the map method
    //The mapreduce framework calls this method every time it reads a row of data
    protected void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        //The specific business logic is written in this method body, and the data to be processed by our business has been passed in by the framework, key-value in the parameters of the method
        //key is the starting offset of this line of data, and value is the text content of this line
    }
}

public class WcReduce extends Reducer<Text, LongWritable, Text, LongWritable>{

    //Rewrite the reduce method after inheriting Reducer
    //The first parameter is the key, and the second parameter is the collection.
    //After the map processing is completed, the framework caches all key-value pairs, groups them, then passes a group <key,value{}>, and calls the reduce method once
    protected void reduce(Text key, Iterable<LongWritable> values,Context context) 
            throws IOException, InterruptedException {
            
    }
}

public class WcRunner {

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        //Create configuration file
        Configuration conf = new Configuration();
        // get a job
        Job job = Job.getInstance(conf);
        
        //Set which jar package the classes used in the entire job are in
        job.setJarByClass(WcRunner.class);
        
        //The class of mapper and reducer used by this job
        job.setMapperClass(WcMap.class);
        job.setReducerClass(WcReduce.class);
        
        //Specify the output data key-value type of reduce
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);
        
        
        //Specify the output data key-value type of mapper
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);
        
        //Specify the input data storage path to be processed
        FileInputFormat.setInputPaths(job, new Path("hdfs://master:9000/user/cg/input"));
        
        //Specify the output data storage path of the processing result
        FileOutputFormat.setOutputPath(job, new Path("hdfs://master:9000/user/cg/output"));
        
        //Submit the job to the cluster to run
        job.waitForCompletion(true);
    } 
}

Guess you like

Origin blog.csdn.net/CNDefoliation/article/details/128009932