Hadoop large data technologies MapReduce (1) - MapReduce Overview

Chapter 1 MapReduce Overview

1.1 MapReduce defined

MapReduce is a distributed computing programming framework program is user-developed "Hadoop-based data analysis applications," the core framework.
MapReduce is a core function of the business logic and user-written code that comes with the default components integrated into a complete distributed computing program, run concurrently on a Hadoop cluster.

1.2 MapReduce advantages and disadvantages

1.2.1 advantage
  1. MapReduce easy to program
    it simply implement some interfaces, you can complete a distributed program, this program can be distributed to a large number of distributed low-cost PC machines running. That you write a distributed program, written with a simple serial program is exactly the same. Because of this feature makes MapReduce programming has become very popular.
  2. Good scalability
    when your computing resources can not be met, you can extend its computing power by simply adding machine.
1.2.2 shortcomings
  1. Not good at real-time computing
    MapReduce not like MySQL, returns results within milliseconds or seconds.
  2. Not good flow computing
    the input data stream is dynamically calculated, and the input data set MapReduce is static, not dynamic changes. This is because the MapReduce their own design characteristics determine the data source must be static.
  3. Not good at the DAG (directed graph) is calculated
    dependencies plurality of application programs, the application of an input to an output of a front. In this case, MapReduce is not can not do, but after use, the output of each MapReduce job is written to disk, it will cause a lot of disk IO, resulting in very poor performance.

1.3 MapReduce core idea

The core MapReduce programming ideas as:
Here Insert Picture Description

  1. Distributed computing procedures often require into at least two phases.
  2. The first stage of MapTask concurrent instances, completely run in parallel, independent of each other.
  3. ReduceTask second stage concurrent instances unrelated, but they all depend on the output data on a stage MapTask concurrent instances.
  4. MapReduce programming model can contain only one stage and a Map Reduce stage, if the user's business logic is very complex, it can only multiple MapReduce programs, serial operation.
    Summary: WordCount data stream analysis to in-depth understanding of the core idea of MapReduce.

1.4 MapReduce process

Here Insert Picture Description

1.5 official source WordCount

Using decompiler decompiling source code, WordCount cases are found Map class, Reduce classes and driving classes. And the data type is a sequence of type Hadoop package itself.

1.6 Common types of data sequence

Table corresponding to the type commonly used data types Hadoop data sequence

Java type Hadoop Writable type
Boolean BooleanWritable
Byte ByteWritable
Int IntWritable
Float FloatWritable
Long FloatWritable
Double DoubleWritable
String Text
Map MapWritable
Array ArrayWritable

1.7 MapReduce programming specification

Written program divided into three parts: Mapper, Reducer and Driver.
Here Insert Picture Description
Here Insert Picture Description

1.8 WordCount practical operation case

1. demand

Statistics in a given text file output a total number of times each word appears
(1) input data
write date.txt text:

zhangsan  lisi  wanger  maizi
xiangming zhangsan  wanger  lisi
xiaoha mazi zhangsan

(2) the desired output data

lisi	2
maizi	1
mazi	1
wanger	2
xiangming	1
xiaoha	1
zhangsan	3
2. demand analysis

According to the MapReduce programming specifications were written Mapper, Reducer, Driver

3. Preparing the Environment

(1) create a maven project with IDEA
(2) add the following dependence in pom.xml file

<dependencies>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>3.1.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>3.1.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>3.1.2</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-yarn-common</artifactId>
            <version>3.1.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-yarn-client</artifactId>
            <version>3.1.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-yarn-server-resourcemanager</artifactId>
            <version>3.1.2</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-core</artifactId>
            <version>3.1.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-jobclient</artifactId>
            <version>3.1.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-common</artifactId>
            <version>3.1.2</version>
        </dependency>
        <dependency>
            <groupId>net.minidev</groupId>
            <artifactId>json-smart</artifactId>
            <version>2.3</version>
        </dependency>
        <dependency>
            <groupId>org.apache.logging.log4j</groupId>
            <artifactId>log4j-core</artifactId>
            <version>2.12.1</version>
        </dependency>
        <dependency>
            <groupId>org.anarres.lzo</groupId>
            <artifactId>lzo-hadoop</artifactId>
            <version>1.0.6</version>
        </dependency>
    </dependencies>

(3) project under src / main / resources directory, create a file named "log4j.properties", fill in the file.

log4j.rootLogger=INFO, stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n
log4j.appender.logfile=org.apache.log4j.FileAppender
log4j.appender.logfile.File=target/spring.log
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout
log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n
4. Programming

(1) write Mapper class

package com.zhangyong.mapreduce;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;

/**
 * @Author zhangyong
 * @Date 2020/3/4 16:35
 * @Version 1.0
 * Mapper类 计算量
 * 泛型一:程序读取数据的偏移量
 * 泛型二:读到的内容
 * 泛型三:输出结果的类型
 * 泛型四:输出结果的内容
 */
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

    @Override
    /**
     * key:偏移量
     * value:读取到的内容
     * context:上下文
     */
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        System.out.println (key.get () + " " + value.toString ());
        String line = value.toString ();
        String[] split = line.split ("\\W+");
        for (String s : split) {
            context.write (new Text (s), new IntWritable (1));
        }
    }
}

(2) write Reducer class

package com.zhangyong.mapreduce;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

/**
 * @Author zhangyong
 * @Date 2020/3/4 16:35
 * @Version 1.0
 * Reducer类 统计量
 * 泛型一:Map传递过来的结果的类型
 * 泛型二: Map传递过来的结果的内容
 * 泛型三:输出结果的类型
 * 泛型四:输出结果的内容
 */
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        System.out.println (key + " : " + values);
        int sum = 0;
        for (IntWritable value : values) {
            sum += value.get ();
        }
        context.write (key, new IntWritable (sum));
    }
}

(3) write driver class Driver

package com.zhangyong.mapreduce;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

import java.io.IOException;


/**
 * @Author zhangyong
 * @Date 2020/3/4 16:35
 * @Version 1.0
 * Driver类 Hadoop入口程序
 */
public class WordCountDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Configuration cfg = new Configuration ();
        //设置本地模式运行(即使项目类路径下core-site.xml文件,依然采用本地模式)
        cfg.set ("mapreduce.framework.name", "local");
        cfg.set ("fs.defaultFS", "file:///");

        Job job = Job.getInstance (cfg);

        job.setJarByClass (WordCountDriver.class);

        //下面两行是默认值,可以省略
        job.setInputFormatClass (TextInputFormat.class);
        job.setOutputFormatClass (TextOutputFormat.class);

        //设置Mapper和Reducer
        job.setMapperClass (WordCountMapper.class);
        job.setReducerClass (WordCountReducer.class);

        //设置Mapper输出的类型
        job.setMapOutputKeyClass (Text.class);
        job.setMapOutputValueClass (IntWritable.class);

        //设置Reducer输出的类型
        job.setOutputKeyClass (Text.class);
        job.setOutputValueClass (IntWritable.class);

        //判断输出的路径是否存在,存在就删除
        Path out = new Path ("src/resources/output");
        FileSystem fs = FileSystem.get (cfg);
        if (fs.exists (out)) {
            fs.delete (out, true);
        }
        //设置待分析的文件夹路径
        FileInputFormat.addInputPath (job, new Path ("src/resources/input"));
        FileOutputFormat.setOutputPath (job, new Path ("src/resources/output"));

        boolean b = job.waitForCompletion (true);
    }
}
5. Project directory structure

Here Insert Picture Description

6. Local test

(1) the need to configure the local environment and Hadoop3.1.2 java1.8 environment
(2) Idea running the program on
Here Insert Picture Description
completion will generate output file at run
Here Insert Picture Description

7. Cluster Test

Here Insert Picture Description

Published 37 original articles · won praise 7 · views 1177

Guess you like

Origin blog.csdn.net/zy13765287861/article/details/104670538