大数据 08 Hadoop之MapReduce

1 MapReduce概述

源自Google的MapReduce论文。

Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

MapReduce

优点：海量数据离线处理&易开发&易运行
缺点：无法进行实时流式计算

2 MapReduce编程模型

wordcount: 统计文件中每个单词出现的次数

需求：求wc
1）文件内容小：用shell
2）文件内容很大： TB GB ???? 如何解决大数据量的统计分析

==> 求url TOPN <== wc的延伸
工作中很多场景的开发都是wc的基础上进行改造的

借助于分布式计算框架来解决：mapreduce

2.1 MapReduce执行流程

这里写图片描述

分而治之

(input) < k1, v1 > -> map -> < k2, v2> -> combine -> < k2, v2 > -> reduce -> < k3, v3 > (output)

1）执行步骤：

扫描二维码关注公众号，回复： 2948502 查看本文章

准备map处理的输入数据
Mapper处理
Shuffle
Reducer处理
结果输出

2）核心概念

Split：

交由MapReduce作业来处理的数据块，是MapReduce中最小的计算单元。
一个 Split 交由一个 Mapper Task 处理。
HDFS：blocksize 是HDFS中最小的存储单元 128M
默认情况下：他们两是一一对应的，当然我们也可以手工设置他们之间的关系（不建议）

InputFormat：

将我们的输入数据进行分片(split): InputSplit[] getSplits(JobConf job, int numSplits) throws IOException;
TextInputFormat: 处理文本格式的数据

OutputFormat: 输出

Combiner

Partitioner

3 MapReduce架构

3.1MapReduce1.x的架构

这里写图片描述
1）JobTracker: JT

作业的管理者（管理的）
将作业分解成一堆的任务：Task（MapTask和ReduceTask）
将任务分派给TaskTracker运行
作业的监控、容错处理（task作业挂了，重启task的机制）
在一定的时间间隔内，JT没有收到TT的心跳信息，TT可能是挂了，TT上运行的任务会被指派到其他TT上去执行

2）TaskTracker: TT

任务的执行者（干活的）
在TT上执行我们的Task（MapTask和ReduceTask）
会与JT进行交互：执行/启动/停止作业，发送心跳信息给JT

3）MapTask

自己开发的map任务交由该Task出来
解析每条记录的数据，交给自己的map方法处理
将map的输出结果写到本地磁盘（有些作业只仅有map没有reduce==>HDFS）

4）ReduceTask

将Map Task输出的数据进行读取
按照数据进行分组传给我们自己编写的reduce方法处理
输出结果写到HDFS

3.2 MapReduce2.x的架构

这里写图片描述

运行在yarn上

提交MapReduce作业，申请资源，查询作业执行状况
找一个节点启动一个container来运行MR App Mstr
向RM申请或注册资源
得到资源
到相应的nodemanager上启动任务
启动任务

4 MapReduce编程

4.1 使用IDEA+Maven开发wc：

1）开发

pom.xml

<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>com.lihaogn.hadoop</groupId>
  <artifactId>hadoop-train</artifactId>
  <version>1.0-</version>

  <name>hadoop-train</name>
  <!-- FIXME change it to the project's website -->
  <url>http://www.example.com</url>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <maven.compiler.source>1.7</maven.compiler.source>
    <maven.compiler.target>1.7</maven.compiler.target>
    <!-- 我的添加 -->
    <hadoop.version>2.6.0-cdh5.7.0</hadoop.version>
  </properties>

  <!-- 添加仓库 -->
  <repositories>
    <repository>
      <id>cloudera</id>
      <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
    </repository>
  </repositories>

  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.11</version>
      <scope>test</scope>
    </dependency>

    <!--添加hadoop依赖-->
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-client</artifactId>
      <version>${hadoop.version}</version>
    </dependency>

  </dependencies>

  <build>
    <pluginManagement><!-- lock down plugins versions to avoid using Maven defaults (may be moved to parent pom) -->
      <plugins>
        <plugin>
          <artifactId>maven-clean-plugin</artifactId>
          <version>3.0.0</version>
        </plugin>
        <!-- see http://maven.apache.org/ref/current/maven-core/default-bindings.html#Plugin_bindings_for_jar_packaging -->
        <plugin>
          <artifactId>maven-resources-plugin</artifactId>
          <version>3.0.2</version>
        </plugin>
        <plugin>
          <artifactId>maven-compiler-plugin</artifactId>
          <version>3.7.0</version>
        </plugin>
        <plugin>
          <artifactId>maven-surefire-plugin</artifactId>
          <version>2.20.1</version>
        </plugin>
        <plugin>
          <artifactId>maven-jar-plugin</artifactId>
          <version>3.0.2</version>
        </plugin>
        <plugin>
          <artifactId>maven-install-plugin</artifactId>
          <version>2.5.2</version>
        </plugin>
        <plugin>
          <artifactId>maven-deploy-plugin</artifactId>
          <version>2.8.2</version>
        </plugin>

        <!--mvn assembly:assembly 将依赖包打进jar-->
        <plugin>
          <artifactId>maven-assembly-plugin</artifactId>
          <configuration>
            <archive>
              <manifest>
                <mainClass></mainClass>
              </manifest>
            </archive>
            <descriptorRefs>
              <descriptorRef>jar-with-dependencies</descriptorRef>
            </descriptorRefs>
          </configuration>
        </plugin>
      </plugins>
    </pluginManagement>
  </build>
</project>

WordCountApp.java

package com.imooc.hadoop.mapreduce;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * 使用MapReduce开发WordCount应用程序
 */
public class WordCountApp {

    /**
     * Map：读取输入的文件
     */
    public static class MyMapper extends Mapper<LongWritable, Text, Text, LongWritable>{

        LongWritable one = new LongWritable(1);

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

            // 接收到的每一行数据
            String line = value.toString();

            //按照指定分隔符进行拆分
            String[] words = line.split(" ");

            for(String word :  words) {
                // 通过上下文把map的处理结果输出
                context.write(new Text(word), one);
            }
        }
    }

    /**
     * Reduce：归并操作
     */
    public static class MyReducer extends Reducer<Text, LongWritable, Text, LongWritable> {

        @Override
        protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {

            long sum = 0;
            for(LongWritable value : values) {
                // 求key出现的次数总和
                sum += value.get();
            }

            // 最终统计结果的输出
            context.write(key, new LongWritable(sum));
        }
    }

    /**
     * 定义Driver：封装了MapReduce作业的所有信息
     */
    public static void main(String[] args) throws Exception{

        //创建Configuration
        Configuration configuration = new Configuration();

        //创建Job
        Job job = Job.getInstance(configuration, "wordcount");

        //设置job的处理类
        job.setJarByClass(WordCountApp.class);

        //设置作业处理的输入路径
        FileInputFormat.setInputPaths(job, new Path(args[0]));

        //设置map相关参数
        job.setMapperClass(MyMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);

        //设置reduce相关参数
        job.setReducerClass(MyReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);

        //设置作业处理的输出路径
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

2）编译：

mvn clean package -DskipTests

3）上传到服务器：

scp target/hadoop-train-1.0.jar hadoop@hadoop000:~/lib

4）运行

前提条件：

启动hadoop
上传文件到HDFS中

hadoop jar /home/hadoop/lib/hadoop-train-1.0.jar \
com.imooc.hadoop.mapreduce.WordCountApp \
hdfs://hadoop000:8020/hello.txt \
hdfs://hadoop000:8020/output/wc

5）结果
这里写图片描述
注意：

1）相同的代码和脚本再次执行，会报错

security.UserGroupInformation:
    PriviledgedActionException as:hadoop (auth:SIMPLE) cause:
    org.apache.hadoop.mapred.FileAlreadyExistsException: 
    Output directory hdfs://hadoop000:8020/output/wc already exists
    Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: 
    Output directory hdfs://hadoop000:8020/output/wc already exists

原因：在MR中，输出文件是不能事先存在的
解决方法：

先手工通过shell的方式将输出文件夹先删除
```
hadoop fs -rm -r /output/wc
```

在代码中完成自动删除功能: 推荐使用这种方式

Path outputPath = new Path(args[1]);
FileSystem fileSystem = FileSystem.get(configuration);
if(fileSystem.exists(outputPath)){
    fileSystem.delete(outputPath, true);
    System.out.println("output file exists, but is has deleted");
}

4.2 MapReduce编程之combiner

1）本地的reducer
2）减少map tasks输出的数据量及数据网络传输量

Combiner：在Mapper上进行一个本地的reduce操作。
这里写图片描述

修改：在main函数中添加一行设置
这里写图片描述

hadoop jar /home/hadoop/lib/hadoop-train-1.0.jar \
com.imooc.hadoop.mapreduce.CombinerApp \
hdfs://hadoop000:8020/hello.txt \
hdfs://hadoop000:8020/output/wc

使用场景：

求和、次数
求平均数是不行的

4.3 MapReduce编程之partitioner

1）partitioner决定maptask输出的数据交由哪个reduceTasks处理
2）默认实现：分发的key的hash值对reduce task个数取模

Partitioner

1.准备数据
这里写图片描述
2.修改代码

package com.imooc.hadoop.mapreduce;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class ParititonerApp {

    /**
     * Map：读取输入的文件
     */
    public static class MyMapper extends Mapper<LongWritable, Text, Text, LongWritable>{

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

            // 接收到的每一行数据
            String line = value.toString();

            //按照指定分隔符进行拆分
            String[] words = line.split(" ");

            context.write(new Text(words[0]), new LongWritable(Long.parseLong(words[1])));

        }
    }

    /**
     * Reduce：归并操作
     */
    public static class MyReducer extends Reducer<Text, LongWritable, Text, LongWritable> {

        @Override
        protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {

            long sum = 0;
            for(LongWritable value : values) {
                // 求key出现的次数总和
                sum += value.get();
            }

            // 最终统计结果的输出
            context.write(key, new LongWritable(sum));
        }
    }

    public static class MyPartitioner extends Partitioner<Text, LongWritable> {

        @Override
        public int getPartition(Text key, LongWritable value, int numPartitions) {

            if(key.toString().equals("xiaomi")) {
                return 0;
            }

            if(key.toString().equals("huawei")) {
                return 1;
            }

            if(key.toString().equals("iphone7")) {
                return 2;
            }

            return 3;
        }
    }


    /**
     * 定义Driver：封装了MapReduce作业的所有信息
     */
    public static void main(String[] args) throws Exception{

        //创建Configuration
        Configuration configuration = new Configuration();

        // 准备清理已存在的输出目录
        Path outputPath = new Path(args[1]);
        FileSystem fileSystem = FileSystem.get(configuration);
        if(fileSystem.exists(outputPath)){
            fileSystem.delete(outputPath, true);
            System.out.println("output file exists, but is has deleted");
        }

        //创建Job
        Job job = Job.getInstance(configuration, "wordcount");

        //设置job的处理类
        job.setJarByClass(ParititonerApp.class);

        //设置作业处理的输入路径
        FileInputFormat.setInputPaths(job, new Path(args[0]));

        //设置map相关参数
        job.setMapperClass(MyMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);

        //设置reduce相关参数
        job.setReducerClass(MyReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);

        //设置job的partition
        job.setPartitionerClass(MyPartitioner.class);
        //设置4个reducer，每个分区一个
        job.setNumReduceTasks(4);

        //设置作业处理的输出路径
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

3.运行

hadoop jar /home/hadoop/lib/hadoop-train-1.0.jar \
com.imooc.hadoop.mapreduce.ParititonerApp \
hdfs://hadoop000:8020/partitioner \
hdfs://hadoop000:8020/output/partitioner

4.结果
这里写图片描述