Nine, MapReduce, the core component of Hadoop

       In the previous articles, we focused on HDFS. From this article, we begin to introduce MapReduce, and follow the column "Broken Cocoon into Butterfly-Hadoop" to view related series of articles~


table of Contents

One, the definition of MapReduce

Two, the advantages and disadvantages of MapReduce

2.1 Advantages

2.2 Disadvantages

Three, the core idea of ​​MapReduce

Fourth, the MapReduce process

Five, encoding to implement MapReduce's WordCount

5.1 Serialization type

5.2 Programming Specifications

5.3 Implement WordCount


 

One, the definition of MapReduce

       MapReduce is a framework of distributed computing programs. The core function of MapReduce is to integrate business logic codes written by users and default components into a complete distributed computing program and run on a Hadoop cluster.

Two, the advantages and disadvantages of MapReduce

2.1 Advantages

       (1) MapReduce is easy to program. It simply implements some interfaces to complete a distributed program, which can be distributed to a large number of cheap PC machines to run. (2) Good scalability. (3) High fault tolerance. (4) Suitable for offline processing of massive data.

2.2 Disadvantages

       (1) Not good at real-time calculation. (2) Not good at stream computing. The input data of streaming computing is dynamic, while the input data set of MapReduce is static and cannot change dynamically. (3) Not good at DAG (directed graph) calculation. Multiple applications have dependencies, and the input of the latter application is the output of the previous one. In this case, MapReduce is not impossible to do, but after use, the output of each MapReduce job will be written to disk, which will cause a lot of disk IO, resulting in very low performance.

Three, the core idea of ​​MapReduce

      (1) Distributed computing programs often need to be divided into at least two stages. (2) The concurrent instances of MapTask in the first phase run completely in parallel and are independent of each other. (3) The ReduceTask concurrent instances in the second stage are not related to each other, but their data depends on the output of all the MapTask concurrent instances in the previous stage. (4) The MapReduce programming model can only contain one Map phase and one Reduce phase. If the user's business logic is very complex, then only multiple MapReduce programs can be run in series.

Fourth, the MapReduce process

       A complete MapReduce program has three types of instance processes in distributed runtime: (1) MrAppMaster: responsible for the process scheduling and state coordination of the entire program. (2) MapTask: Responsible for the entire data processing flow of the Map phase. (3) ReduceTask: Responsible for the entire data processing flow of the Reduce phase.

       Simply put, MapReduce consists of two stages: Map and Reduce. The Map phase processes the input data in parallel, and the Reduce phase summarizes the results of the Map. MapTask writes the results to disk, ReduceTask reads a copy of data from each MapTask, and Shuffle connects the two stages of Map and Reduce.

Five, encoding to implement MapReduce's WordCount

5.1 Serialization type

       Through analyzing the source code, it is found that WordCount cases include Map, Reduce, and Driver. And the type of data is the serialized type encapsulated by Hadoop itself. Commonly used data serialization types are as follows:

5.2 Programming Specifications

       The written program is divided into three parts: Mapper, Reducer and Driver.

       1. Mapper stage

       (1) The user-defined Mapper should inherit its parent class. (2) The input data of Mapper is in the form of KV pairs (KV types can be customized). (3) The business logic in Mapper is written in the map() method. (4) The output data of Mapper is in the form of KV pairs (KV types can be defined by themselves). (5) The map() method (MapTask process) is called once for each <K,V>.

       2. Reducer stage

       (1) User-defined Reducer should inherit from its parent class. (2) The input data type of Reducer corresponds to the output data type of Mapper, which is also KV. (3) The business logic of the Reducer is written in the reduce() method. (4) The ReduceTask process calls the reduce() method once for each <K,V> group of the same K.

       3. Driver stage

       It is equivalent to the Yarn cluster client, which is used to submit our entire program to the Yarn cluster, and the job object that encapsulates the relevant operating parameters of the MapReduce program is submitted.

5.3 Implement WordCount

       1. Requirements: Given a file, count the number of occurrences of each word in the file. The contents of the file are as follows:

       2. Create a new Maven project and add the following dependencies in the pom.xml file:

<dependency>
    <groupId>junit</groupId>
    <artifactId>junit</artifactId>
    <version>RELEASE</version>
</dependency>
<dependency>
    <groupId>org.apache.logging.log4j</groupId>
    <artifactId>log4j-core</artifactId>
    <version>2.8.2</version>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-common</artifactId>
    <version>2.7.2</version>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>2.7.2</version>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-hdfs</artifactId>
    <version>2.7.2</version>
</dependency>

       3. Create a new file in the src/main/resources directory of the project and name it "log4j.properties", and fill in the following content in the file:

log4j.rootLogger=INFO, stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n
log4j.appender.logfile=org.apache.log4j.FileAppender
log4j.appender.logfile.File=target/spring.log
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout
log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n

       4. Write the Mapper class

package com.xzw.hadoop.mapreduce.wordcount;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 *                             _ooOoo_
 *                            o8888888o
 *                            88" . "88
 *                            (| -_- |)
 *                            O\  =  /O
 *                         ____/`---'\____
 *                       .'  \\|     |//  `.
 *                      /  \\|||  :  |||//  \
 *                     /  _||||| -:- |||||-  \
 *                     |   | \\\  -  /// |   |
 *                     | \_|  ''\---/''  |   |
 *                     \  .-\__  `-`  ___/-. /
 *                   ___`. .'  /--.--\  `. . __
 *                ."" '<  `.___\_<|>_/___.'  >'"".
 *               | | :  `- \`.;`\ _ /`;.`/ - ` : | |
 *               \  \ `-.   \_ __\ /__ _/   .-` /  /
 *          ======`-.____`-.___\_____/___.-`____.-'======
 *                             `=---='
 *          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 *                     佛祖保佑        永无BUG
 * @Description
 * @Author xzw
 * @Date Created by 2020/5/19 10:16
 */
public class WcMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    private Text word = new Text();
    private IntWritable one = new IntWritable(1);

    /**
     * MapTask进程
     * @param key
     * @param value
     * @param context
     * @throws IOException
     * @throws InterruptedException
     */
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        //拿到这行数据
        String line = value.toString();

        //按照Tab进行切分
        String[] words = line.split("\t");

        //遍历数组,把单词变成(word,1)的形式
        for (String word: words) {
            this.word.set(word);
            context.write(this.word, this.one);
        }
    }
}

       5. Write the Reducer class

package com.xzw.hadoop.mapreduce.wordcount;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

/**
 * @Description
 * @Author xzw
 * @Date Created by 2020/5/19 10:27
 */
public class WcReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    private IntWritable total = new IntWritable();

    /**
     * ReduceTask进程
     * @param key
     * @param values
     * @param context
     * @throws IOException
     * @throws InterruptedException
     */
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,
            InterruptedException {
        //求和
        int sum = 0;
        for (IntWritable value: values) {
            sum += value.get();
        }

        //包装结果并输出
        total.set(sum);
        context.write(key, total);

    }
}

       6. Write Driver driver class

package com.xzw.hadoop.mapreduce.wordcount;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * @Description
 * @Author xzw
 * @Date Created by 2020/5/19 10:36
 */
public class WcDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        //设置输入输出路径(便于测试)
        args = new String[]{"e:/input/xzw.txt", "e:/output"};

        //1、获取一个job实例
        Job job = Job.getInstance(new Configuration());

        //2、设置类路径
        job.setJarByClass(WcDriver.class);

        //3、设置Mapper和Reducer
        job.setMapperClass(WcMapper.class);
        job.setReducerClass(WcReducer.class);

        //4、设置Mapper和Reducer的输出类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        //5、设置输入输出数据(便于测试,直接调用自己定义的路径)
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        //6、提交job
        boolean b = job.waitForCompletion(true);
        System.exit(b ? 0 : 1);
    }
}

       7. Test and get test results

       Content in part-r-00000: 

       The procedure here shows that a simple WordCount has been realized. I just tested it locally, or it can be packaged and run on the cluster. The following is an introduction to how to package and submit to run on Yarn (I clearly remember that I compiled a packaging blog before, which wrote a very detailed and specific packaging process, but unfortunately I can't find it... Crazy~~ ). There are also many ways to package Maven projects built with IDEA. Some can be packaged directly with Maven, and some can be packaged with Maven plug-ins. Here is a native, very simple but very easy to use packaging method, because Maven is used. Packaging may fail due to network issues, such as plug-ins that are not downloaded. Okay, let's not talk more nonsense, just start.

       (1)File-->Project Structure-->Artifacts-->+-->JAR-->From modules with dependencies...

       (2) Select the main category

       (3) Click Apply-->OK

       (4)Build-->Build Artifacts

       (5) Select Build in the small box that pops up

       (6) Upload the finished package to the server and run the following command:

hadoop jar wordcount.jar com.xzw.hadoop.mapreduce.wordcount.WcDriver /xzw/xzw.txt /xzw/output

       The following result appears, indicating that the operation was successful:

       (7) View cluster operation results

 

       At this point, this article is over. What problems did you encounter in this process, welcome to leave a message, let me see what problems you encountered~

Guess you like

Origin blog.csdn.net/gdkyxy2013/article/details/107606716