大数据 08 Hadoop之MapReduce

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/lihaogn/article/details/81978005

1 MapReduce概述

源自Google的MapReduce论文。

Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

MapReduce

  • 优点:海量数据离线处理&易开发&易运行
  • 缺点:无法进行实时流式计算

2 MapReduce编程模型

wordcount: 统计文件中每个单词出现的次数

需求:求wc
1)文件内容小:用shell
2)文件内容很大: TB GB ???? 如何解决大数据量的统计分析

==> 求url TOPN <== wc的延伸
工作中很多场景的开发都是wc的基础上进行改造的

借助于分布式计算框架来解决:mapreduce

2.1 MapReduce执行流程

这里写图片描述

分而治之

(input) < k1, v1 > -> map -> < k2, v2> -> combine -> < k2, v2 > -> reduce -> < k3, v3 > (output)

1)执行步骤:

扫描二维码关注公众号,回复: 2948502 查看本文章
  1. 准备map处理的输入数据
  2. Mapper处理
  3. Shuffle
  4. Reducer处理
  5. 结果输出

2)核心概念

Split:

  • 交由MapReduce作业来处理的数据块,是MapReduce中最小的计算单元。
  • 一个 Split 交由一个 Mapper Task 处理。
  • HDFS:blocksize 是HDFS中最小的存储单元 128M
  • 默认情况下:他们两是一一对应的,当然我们也可以手工设置他们之间的关系(不建议)

InputFormat:

  • 将我们的输入数据进行分片(split): InputSplit[] getSplits(JobConf job, int numSplits) throws IOException;
  • TextInputFormat: 处理文本格式的数据

OutputFormat: 输出

Combiner

Partitioner

3 MapReduce架构

3.1MapReduce1.x的架构

这里写图片描述
1)JobTracker: JT

  • 作业的管理者(管理的)
  • 将作业分解成一堆的任务:Task(MapTask和ReduceTask)
  • 将任务分派给TaskTracker运行
  • 作业的监控、容错处理(task作业挂了,重启task的机制)
  • 在一定的时间间隔内,JT没有收到TT的心跳信息,TT可能是挂了,TT上运行的任务会被指派到其他TT上去执行

2)TaskTracker: TT

  • 任务的执行者(干活的)
  • 在TT上执行我们的Task(MapTask和ReduceTask)
  • 会与JT进行交互:执行/启动/停止作业,发送心跳信息给JT

3)MapTask

  • 自己开发的map任务交由该Task出来
  • 解析每条记录的数据,交给自己的map方法处理
  • 将map的输出结果写到本地磁盘(有些作业只仅有map没有reduce==>HDFS)

4)ReduceTask

  • 将Map Task输出的数据进行读取
  • 按照数据进行分组传给我们自己编写的reduce方法处理
  • 输出结果写到HDFS
3.2 MapReduce2.x的架构

这里写图片描述

运行在yarn上

  1. 提交MapReduce作业,申请资源,查询作业执行状况
  2. 找一个节点启动一个container来运行MR App Mstr
  3. 向RM申请或注册资源
  4. 得到资源
  5. 到相应的nodemanager上启动任务
  6. 启动任务

4 MapReduce编程

4.1 使用IDEA+Maven开发wc:

1)开发

pom.xml

<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>com.lihaogn.hadoop</groupId>
  <artifactId>hadoop-train</artifactId>
  <version>1.0-</version>

  <name>hadoop-train</name>
  <!-- FIXME change it to the project's website -->
  <url>http://www.example.com</url>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <maven.compiler.source>1.7</maven.compiler.source>
    <maven.compiler.target>1.7</maven.compiler.target>
    <!-- 我的添加 -->
    <hadoop.version>2.6.0-cdh5.7.0</hadoop.version>
  </properties>

  <!-- 添加仓库 -->
  <repositories>
    <repository>
      <id>cloudera</id>
      <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
    </repository>
  </repositories>

  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.11</version>
      <scope>test</scope>
    </dependency>

    <!--添加hadoop依赖-->
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-client</artifactId>
      <version>${hadoop.version}</version>
    </dependency>

  </dependencies>

  <build>
    <pluginManagement><!-- lock down plugins versions to avoid using Maven defaults (may be moved to parent pom) -->
      <plugins>
        <plugin>
          <artifactId>maven-clean-plugin</artifactId>
          <version>3.0.0</version>
        </plugin>
        <!-- see http://maven.apache.org/ref/current/maven-core/default-bindings.html#Plugin_bindings_for_jar_packaging -->
        <plugin>
          <artifactId>maven-resources-plugin</artifactId>
          <version>3.0.2</version>
        </plugin>
        <plugin>
          <artifactId>maven-compiler-plugin</artifactId>
          <version>3.7.0</version>
        </plugin>
        <plugin>
          <artifactId>maven-surefire-plugin</artifactId>
          <version>2.20.1</version>
        </plugin>
        <plugin>
          <artifactId>maven-jar-plugin</artifactId>
          <version>3.0.2</version>
        </plugin>
        <plugin>
          <artifactId>maven-install-plugin</artifactId>
          <version>2.5.2</version>
        </plugin>
        <plugin>
          <artifactId>maven-deploy-plugin</artifactId>
          <version>2.8.2</version>
        </plugin>

        <!--mvn assembly:assembly 将依赖包打进jar-->
        <plugin>
          <artifactId>maven-assembly-plugin</artifactId>
          <configuration>
            <archive>
              <manifest>
                <mainClass></mainClass>
              </manifest>
            </archive>
            <descriptorRefs>
              <descriptorRef>jar-with-dependencies</descriptorRef>
            </descriptorRefs>
          </configuration>
        </plugin>
      </plugins>
    </pluginManagement>
  </build>
</project>

WordCountApp.java

package com.imooc.hadoop.mapreduce;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * 使用MapReduce开发WordCount应用程序
 */
public class WordCountApp {

    /**
     * Map:读取输入的文件
     */
    public static class MyMapper extends Mapper<LongWritable, Text, Text, LongWritable>{

        LongWritable one = new LongWritable(1);

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

            // 接收到的每一行数据
            String line = value.toString();

            //按照指定分隔符进行拆分
            String[] words = line.split(" ");

            for(String word :  words) {
                // 通过上下文把map的处理结果输出
                context.write(new Text(word), one);
            }
        }
    }

    /**
     * Reduce:归并操作
     */
    public static class MyReducer extends Reducer<Text, LongWritable, Text, LongWritable> {

        @Override
        protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {

            long sum = 0;
            for(LongWritable value : values) {
                // 求key出现的次数总和
                sum += value.get();
            }

            // 最终统计结果的输出
            context.write(key, new LongWritable(sum));
        }
    }

    /**
     * 定义Driver:封装了MapReduce作业的所有信息
     */
    public static void main(String[] args) throws Exception{

        //创建Configuration
        Configuration configuration = new Configuration();

        //创建Job
        Job job = Job.getInstance(configuration, "wordcount");

        //设置job的处理类
        job.setJarByClass(WordCountApp.class);

        //设置作业处理的输入路径
        FileInputFormat.setInputPaths(job, new Path(args[0]));

        //设置map相关参数
        job.setMapperClass(MyMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);

        //设置reduce相关参数
        job.setReducerClass(MyReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);

        //设置作业处理的输出路径
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

2)编译:

mvn clean package -DskipTests

3)上传到服务器:

scp target/hadoop-train-1.0.jar hadoop@hadoop000:~/lib

4)运行

前提条件:

  • 启动hadoop
  • 上传文件到HDFS中
hadoop jar /home/hadoop/lib/hadoop-train-1.0.jar \
com.imooc.hadoop.mapreduce.WordCountApp \
hdfs://hadoop000:8020/hello.txt \
hdfs://hadoop000:8020/output/wc

5)结果
这里写图片描述
注意:

1)相同的代码和脚本再次执行,会报错

security.UserGroupInformation:
    PriviledgedActionException as:hadoop (auth:SIMPLE) cause:
    org.apache.hadoop.mapred.FileAlreadyExistsException: 
    Output directory hdfs://hadoop000:8020/output/wc already exists
    Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: 
    Output directory hdfs://hadoop000:8020/output/wc already exists

原因:在MR中,输出文件是不能事先存在的
解决方法:

  1. 先手工通过shell的方式将输出文件夹先删除

    hadoop fs -rm -r /output/wc
  2. 在代码中完成自动删除功能: 推荐使用这种方式

    Path outputPath = new Path(args[1]);
    FileSystem fileSystem = FileSystem.get(configuration);
    if(fileSystem.exists(outputPath)){
        fileSystem.delete(outputPath, true);
        System.out.println("output file exists, but is has deleted");
    }
4.2 MapReduce编程之combiner

1)本地的reducer
2)减少map tasks输出的数据量及数据网络传输量

Combiner:在Mapper上进行一个本地的reduce操作。
这里写图片描述

修改:在main函数中添加一行设置
这里写图片描述

hadoop jar /home/hadoop/lib/hadoop-train-1.0.jar \
com.imooc.hadoop.mapreduce.CombinerApp \
hdfs://hadoop000:8020/hello.txt \
hdfs://hadoop000:8020/output/wc

使用场景:

  • 求和、次数
  • 求平均数是不行的
4.3 MapReduce编程之partitioner

1)partitioner决定maptask输出的数据交由哪个reduceTasks处理
2)默认实现:分发的key的hash值对reduce task个数取模

Partitioner

1.准备数据
这里写图片描述
2.修改代码

package com.imooc.hadoop.mapreduce;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class ParititonerApp {

    /**
     * Map:读取输入的文件
     */
    public static class MyMapper extends Mapper<LongWritable, Text, Text, LongWritable>{

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

            // 接收到的每一行数据
            String line = value.toString();

            //按照指定分隔符进行拆分
            String[] words = line.split(" ");

            context.write(new Text(words[0]), new LongWritable(Long.parseLong(words[1])));

        }
    }

    /**
     * Reduce:归并操作
     */
    public static class MyReducer extends Reducer<Text, LongWritable, Text, LongWritable> {

        @Override
        protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {

            long sum = 0;
            for(LongWritable value : values) {
                // 求key出现的次数总和
                sum += value.get();
            }

            // 最终统计结果的输出
            context.write(key, new LongWritable(sum));
        }
    }

    public static class MyPartitioner extends Partitioner<Text, LongWritable> {

        @Override
        public int getPartition(Text key, LongWritable value, int numPartitions) {

            if(key.toString().equals("xiaomi")) {
                return 0;
            }

            if(key.toString().equals("huawei")) {
                return 1;
            }

            if(key.toString().equals("iphone7")) {
                return 2;
            }

            return 3;
        }
    }


    /**
     * 定义Driver:封装了MapReduce作业的所有信息
     */
    public static void main(String[] args) throws Exception{

        //创建Configuration
        Configuration configuration = new Configuration();

        // 准备清理已存在的输出目录
        Path outputPath = new Path(args[1]);
        FileSystem fileSystem = FileSystem.get(configuration);
        if(fileSystem.exists(outputPath)){
            fileSystem.delete(outputPath, true);
            System.out.println("output file exists, but is has deleted");
        }

        //创建Job
        Job job = Job.getInstance(configuration, "wordcount");

        //设置job的处理类
        job.setJarByClass(ParititonerApp.class);

        //设置作业处理的输入路径
        FileInputFormat.setInputPaths(job, new Path(args[0]));

        //设置map相关参数
        job.setMapperClass(MyMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);

        //设置reduce相关参数
        job.setReducerClass(MyReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);

        //设置job的partition
        job.setPartitionerClass(MyPartitioner.class);
        //设置4个reducer,每个分区一个
        job.setNumReduceTasks(4);

        //设置作业处理的输出路径
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

3.运行

hadoop jar /home/hadoop/lib/hadoop-train-1.0.jar \
com.imooc.hadoop.mapreduce.ParititonerApp \
hdfs://hadoop000:8020/partitioner \
hdfs://hadoop000:8020/output/partitioner

4.结果
这里写图片描述

猜你喜欢

转载自blog.csdn.net/lihaogn/article/details/81978005