Hadoop study notes - MapReduce

1. Overview of MapReduce

1.1. Definition of MapReduce

  MapReduce is a programming framework for distributed computing programs and the core framework for users to develop "Hadoop-based data analysis applications".
  The core function of MapReduce is to integrate the business logic code written by users and its own default components into a complete distributed computing program , which runs concurrently on a Hadoop cluster.

1.2. Advantages and disadvantages of MapReduce

1.2.1 Advantages

  1. MapReduce is easy to program.
      It simply implements some interfaces to complete a distributed program. This distributed program can be distributed to a large number of cheap PC machines to run. That is to say, when you write a distributed program, it is exactly the same as writing a simple serial program. It is because of this feature that MapReduce programming has become very popular.
  2. Good scalability
      When your computing resources cannot be satisfied, you can expand its computing power by simply adding machines.
  3. High fault tolerance
      The original intention of MapReduce design is to enable the program to be deployed on cheap PC machines, which requires it to have high fault tolerance. For example, if one of the machines hangs up, it can transfer the above computing tasks to another node to run, so that the task will not fail, and this process does not require manual participation, but is completely completed by Hadoop.
  4. Suitable for off-line processing of massive data above PB level,
      it can realize concurrent work of thousands of server clusters and provide data processing capabilities.

1.2.2 Disadvantages

  1. Not good at real-time computing
      MapReduce cannot return results in milliseconds or seconds like MySQL.
  2. Not good at streaming computing
      The input data of streaming computing is dynamic, while the input data set of MapReduce is static and cannot be changed dynamically. This is because the design characteristics of MapReduce itself determine that the data source must be static.
  3. Not good at DAG (Directed Acyclic Graph) calculation
      There are dependencies among multiple applications, and the input of the latter application is the output of the previous one. In this case, it is not that MapReduce cannot do it, but after using it, the output result of each MapReduce job will be written to the disk, which will cause a lot of disk IO, resulting in very low performance.

1.3. The core idea of ​​MapReduce

main idea

  1. Distributed computing programs often need to be divided into at least two stages.
  2. The concurrent instances of MapTask in the first stage run completely in parallel and are independent of each other.
  3. The concurrent instances of ReduceTask in the second stage are independent of each other, but their data depends on the output of all concurrent instances of MapTask in the previous stage.
  4. The MapReduce programming model can only contain one Map stage and one Reduce stage. If the user's business logic is very complex, only multiple MapReduce programs can be run serially.

  Summary: Analyze the WordCount data flow to deeply understand the core idea of ​​MapReduce.

1.4, MapReduce process

  A complete MapReduce program has three types of instance processes during distributed runtime:

  1. MrAppMaster: responsible for the process scheduling and status coordination of the entire program.
  2. MapTask: Responsible for the entire data processing process in the Map phase.
  3. ReduceTask: Responsible for the entire data processing process of the Reduce phase.

1.5. Official WordCount source code

  Use the decompilation tool to decompile the source code, and find that the WordCount case includes Map class, Reduce class and driver class. And the data type is the serialized type encapsulated by Hadoop itself.

1.6. Commonly used data serialization types

Java type Hadoop Writable Type
Boolean BooleanWritable
Byte ByteWritable
Int IntWritable
Float FloatWritable
Long LongWritable
Double DoubleWritable
String Text
Map MapWritable
Array ArrayWritable
Null NullWritable

1.7. MapReduce program specification

The program written by the user is divided into three parts: Mapper, Reducer and Driver.

  1. Mapper stage
    1. User-defined Mappex should inherit its own parent class
    2. The input data of Mapper is in the form of KV pair t (the type of KV can be customized)
    3. The business logic in Mapper is written in the map() method
    4. The output data of Mapper is in the form of KV pairs (the type of KV can be customized)
    5. The map0 party (MapTaski process) calls once for each <K, V>
  2. Reducer stage
    1. User-defined Reducer must inherit its own parent class
    2. The input data type of Reducer corresponds to the output data type of Mapper, which is also KV
    3. Reducer's business logic is written in the reduce() method well
    4. The ReduceTask process calls the reduceQ method once for each group of the same k.
  3. The Driver stage
      is equivalent to the client of the YARN cluster. It is used to submit our entire program to the YARN cluster. What is submitted is a job object that encapsulates the operating parameters of the MapReduce program.

1.8, WordCount case practice

1.8.1 Local testing

  1. It is required
      to count the total number of occurrences of each word in a given text,
    prepare a data file
    and upload the data to HDFS
    data file
sherry sherry banzhang banzhang cls cls wly wly hadoop xue sss
  1. Requirements analysis
      According to the MapReduce programming specification, Mapper, Reducer, and Driver are written respectively.
  2. Environmental preparation
    1. Create a Maven project => MapReduce
    2. Add the following dependencies to the pom.xml file
<dependencies>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>3.1.3</version>
    </dependency>
    <dependency>
        <groupId>junit</groupId>
        <artifactId>junit</artifactId>
        <version>4.12</version>
    </dependency>
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-log4j12</artifactId>
        <version>1.7.30</version>
    </dependency>
</dependencies>

   3. In the src/main/resource directory of the project, create a new file named "log4j.properties", fill in the file

log4j.rootLogger=INFO, stdout  
log4j.appender.stdout=org.apache.log4j.ConsoleAppender  
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout  
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n  
log4j.appender.logfile=org.apache.log4j.FileAppender  
log4j.appender.logfile.File=target/spring.log  
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout  
log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n

   4. Create package name: com.sherry.MapReduce.wordcount

  1. Programming
  • Write the Mapper class
package com.sherry.MapReduce.wordcount;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    
    

    Text k = new Text();
    IntWritable v = new IntWritable(1);

    @Override
    protected void map(LongWritable key, Text value, Context context)	throws IOException, InterruptedException {
    
    

        // 1 获取一行
        String line = value.toString();

        // 2 切割
        String[] words = line.split(" ");

        // 3 输出
        for (String word : words) {
    
    

            k.set(word);
            context.write(k, v);
        }
    }
}
  • Write the Reducer class
package com.sherry.MapReduce.wordcount;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
    
    

    int sum;
    IntWritable v = new IntWritable();

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {
    
    

        // 1 累加求和
        sum = 0;
        for (IntWritable count : values) {
    
    
            sum += count.get();
        }

        // 2 输出
        v.set(sum);
        context.write(key,v);
    }
}
  • Write Driver driver class
package com.sherry.MapReduce.wordcount;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCountDriver {
    
    

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
    
    

        // 1 获取配置信息以及获取job对象
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        // 2 关联本Driver程序的jar
        job.setJarByClass(WordCountDriver.class);

        // 3 关联Mapper和Reducer的jar
        job.setMapperClass(WordCountMapper.class);
        job.setReducerClass(WordCountReducer.class);

        // 4 设置Mapper输出的kv类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        // 5 设置最终输出kv类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        // 6 设置输入和输出路径
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        // 7 提交job
        boolean result = job.waitForCompletion(true);
        System.exit(result ? 0 : 1);
    }
}
  1. local test
  • You need to configure the HADOOP_HOME variable and Windows running dependencies first
  • Run the program on IDEA/Eclipse

1.8.2 Submit to the cluster test

Test on cluster

  • Package the program into a jar package
    . If the program has a target folder, clean it first.
    Pack
    Change the name of the jar package without dependencies to wc.jar, and copy the jar package to the /opt/module/hadoop-3.1.3/myjarpath of the Hadoop cluster.
    import jar package
  • Execute the wordcount program.
    Remember to open hdfs and yarn before executing the wordcount program.
    Note: file paths and the like, where your files are uploaded
hadoop jar wc.jar com.sherry.MapReduce.wordcount.WordCountDriver /wordcount/input/hello.txt  /wordcount/output/wc

start program

Guess you like

Origin blog.csdn.net/weixin_53547097/article/details/128373513