[Hadoop series] (2) The principle and use of MapReduce

MapReuduce

1, MapReduce concept

Mapreduce is a programming framework for distributed computing programs. Its core function is to integrate business logic code written by users and its own default components into a complete distributed computing program, which runs concurrently on a Hadoop cluster.

Mapreduce is easy to program, has good scalability, and is suitable for processing petabyte-level data; however, it is not suitable for processing real-time data, churn computing, and directed graph computing.

2. MapReduce design concept

MapReduce thought modules are mainly divided into: Input、Spilt、Map、Shuffle、Reduceetc.

Input:Read reads data; InputFormat splits the file into multiple InputSplits, and RecordReaders converts InputSplits into standard <key, value> key-value pairs as the output of map;

Spilt: In this process, the data is roughly divided into rows to obtain <Key, Value> type data;

Map: Fine-grained segmentation to obtain <Key, List> type data; sort and partition files in the ring buffer, when the amount of data is large, it will be overwritten to the disk, and the size of the buffer will determine the performance of the MR task. The default size is 100M. In this process, the Combine task can be set, and the initial aggregation will be performed according to the same Key (partion>3 will be combined);

combine: Preliminary aggregation of Merge, which mainly includes strategies such as partition number and the same key, and the data in the same partition is ordered;

Shuufle: Shuffling, that is, combining the results of each MapTask and outputting it to Reduce. The output of this process data is a Copy process. This process involves network IO, which is a time-consuming process and a core process.

Reduce: Merge the split data fragments. Merge sorting will be involved.

Partition: supports custom output partition, the default classifier is HashPartition. Formula: (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks
image-20210623194712880
image-20210623194518462
Let's try to explain the specific running process of MapReduce:

  1. Before the client submits the task, (InputFormat) will divide the data into slices according to the configuration strategy (SpiltSize defaults to blockSize 128M), and each slice is submitted to a MapTask (YARN is responsible for submission);
  2. MapTask executes tasks, generates <K, V> pairs according to the map function, outputs the results to the ring buffer, and then partitions, sorts, and overflows;
  3. Shuffle, that is, divide the map results into multiple partitions and assign them to multiple reduce tasks. This process is called Shuffle.
  4. Reduce, copy the data of the partition after the map (fetch process, default 5 threads to perform the copy), and perform the merge operation after all are completed.
    mr

The distributed feature is that a Job has multiple MapperTasks, Shuffles, and Reduces. The following figure is a good illustration of the distributed parallel process.
img

3. Write the mapreduce program:

MR programming framework:

1) Mapper stage

(1) User-defined Mapper should inherit its own parent class

(2) The input data of Mapper is in the form of KV pairs (the type of KV can be customized)

(3) The business logic in Mapper is written in the map() method

(4) The output data of Mapper is in the form of KV pairs (the type of KV can be customized)

(5) The map() method (maptask process) is called once for each <K, V>

2) Reducer stage

(1) User-defined Reducer should inherit its own parent class

(2) The input data type of Reducer corresponds to the output data type of Mapper, which is also KV

(3) The business logic of the Reducer is written in the reduce() method

(4) The Reducetask process calls the reduce() method once for each group of <k,v> groups of the same k

3) Driver stage

The entire program requires a Drvier to submit, and the submission is a job object that describes various necessary information.

4, MapReduce classic word frequency statistics case

Now write the first MapReduce program to implement the WordCount case:

Environment preparation:

IDEA creates a new maven project, introduces the core dependencies of the corresponding version of hadoop, and incorporates the configuration file into the resource management:

<dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.7.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.7.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>2.7.2</version>
</dependency>
image-20201118092341397

1. Write a map program

public class WordCountMap extends Mapper<LongWritable, Text,Text, IntWritable> {
    
    

    Text k = new Text();
    IntWritable v = new IntWritable(1);
    //重写map方法,实现业务逻辑
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    
    
        //1,获取一行
        String line = value.toString();
        //2,切割
        String[] words = line.split(" ");
        for(String word:words){
    
    
            k.set(word);
            context.write(k,v);
        }

    }
}

2. Write the reduce program

public class WordCountReduce extends Reducer<Text,IntWritable,Text,IntWritable> {
    
    
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
    
    
        //子类构造方法中会有一个默认隐含的super()方法,用于调用父类构造
        //super.reduce(key, values, context);
        int sum = 0;
        for(IntWritable count:values){
    
    
            sum += count.get();
        }
        context.write(key,new IntWritable(sum));
    }
}

3. Write the driver class

public class WordCountDriver {
    
    
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
    
    
        //1,获取job
        Configuration configuration = new Configuration();
        Job job = Job.getInstance();
        //2,设置jar加载路径
        job.setJarByClass(WordCountDriver.class);
        job.setMapperClass(WordCountMap.class);
        job.setReducerClass(WordCountReduce.class);
        // 4 设置 map 输出
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
        // 5 设置 Reduce 输出
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        // 6 设置输入和输出路径
        FileInputFormat.setInputPaths(job,new Path(args[0]));
        FileOutputFormat.setOutputPath(job,new Path(args[1]));
        // 7 提交
        boolean result = job.waitForCompletion(true);
        System.exit(result ? 0 : 1);
    }
}

Then use the tool to make the main file into a jar package:

In project structre->join Artifacts, select From moudles from dependencies, select your own mainclass , and set the jarpackage output directory; then through the toolbar Build, select build Artifacts->buildto achieve the jarpackage. Finally, through the FTPtool , send the jarpackage to the cluster environment, and use the following command to run.

hadoop jar jar包名 main类的全类名 输入目录 输出目录

image-20201117112544404

5. Development skills

So how do you run the mapreduceprogram ?

In actual production development, the program needs to be tested locally before being jarpackaged released to the cluster.

First, we need to set the mainclass , configurationand add the input and output parameters in advance:

image-20201118093855794

Then set to modify the file system to local operation mode:

image-20201118094020228

Reference link

Guess you like

Origin blog.csdn.net/qq_40589204/article/details/118160989