Hadoop之 MapReduce （Shuffle机制详解）

文章目录

3、Shuffle机制详解

3.1 Shuffle 机制
3.2 Partition
3.3 Partition分区案例练习
3.4 WritableComparable 排序
3.5 WritableComparable 排序案例实操（全排序）
3.6 WritableComparable 排序案例实操（区内排序）
3.7 Combiner 合并
3.8 Combiner 合并案例实操

3、Shuffle机制详解

3.1 Shuffle 机制

Map 方法之后，Reduce 方法之前的数据处理过程称之为Shuffle。

在这里插入图片描述

3.2 Partition

1）需求

要求将统计结果按照条件输出到不同文件中（分区）。比如：将统计结果按照手机归属地不同省份输出到不同文件中（分区）

2）默认 Partitioner 分区

public class HashPartitioner<K, V> extends Partitioner<K, V> {

  public int getPartition(K key, V value, int numReduceTasks) {
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
  }
}

默认分区是根据 key 的 hashCode 对 ReduceTasks 个数取模得到的。用户没法控制哪个key存储到哪个分区。

3）自定义 Partitioner 步骤

（1）自定义类继承 Partitioner，重写 getPartition() 方法

public class CustomPartitioner extends Partitioner<Text, FlowBean> {
 	@Override
	public int getPartition(Text key, FlowBean value, int numPartitions) {
          // 控制分区代码逻辑
    … …
		return partition;
	}
}

（2）在 Job 驱动中，设置自定义 Partitioner

job.setPartitionerClass(CustomPartitioner.class);

（3）自定义 Partition 后，要根据自定义 Partitioner 的逻辑设置相应数量的 ReduceTask

job.setNumReduceTasks(5);
//默认的是job.setNumReduceTasks(1);

4）分区总结

（1）如果 ReduceTask 的数量大于 getPartition 的结果数，则会多产生几个空的输出文件part-r-000xx；

（2）如果1 < ReduceTask 的数量 < getPartition 的结果数，则有一部分分区数据无处安放，会 Exception；

（3）如果 ReduceTask 的数量 =1，则不管 MapTask 端输出多少个分区文件，最终结果都交给这一个 ReduceTask，最终也就只会产生一个结果文件 part-r-00000；

（4）分区号必须从零开始，逐一累加。

5）案例分析

例如：假设自定义分区数为 5，则

job.setNumReduceTasks(1);	//会正常运行，只不过会产生一个输出文件

job.setNumReduceTasks(2);	//会报错

job.setNumReduceTasks(6);	//大于5，程序会正常运行，会产生空文件

3.3 Partition分区案例练习

1）需求

将统计结果按照手机归属地不同省份输出到不同文件中（分区）

（1）输入数据

phone_data.txt

（2）期望输出数据

手机号136、137、138、139开头都分别放到一个独立的4个文件中，其他开头的放到一个文件中。

2）需求分析

在这里插入图片描述

3）在前面的序列化案例 FlowBean 的基础上，增加一个分区类

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
public class ProvincePartitioner extends Partitioner<Text, FlowBean> {
	@Override
	public int getPartition(Text key, FlowBean value, int numPartitions) {

		// 1 获取电话号码的前三位
		String preNum = key.toString().substring(0, 3);
		int partition = 4;

		// 2 判断是哪个省
		if ("136".equals(preNum)) {
			partition = 0;
		}else if ("137".equals(preNum)) {
			partition = 1;
		}else if ("138".equals(preNum)) {
			partition = 2;
		}else if ("139".equals(preNum)) {
			partition = 3;
		}
		return partition;
	}
}

4）在驱动函数中增加自定义数据分区设置和ReduceTask设置

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class FlowsumDriver {
	public static void main(String[] args) throws IllegalArgumentException, IOException, ClassNotFoundException, InterruptedException {
		// 输入输出路径需要根据自己电脑上实际的输入输出路径设置
		args = new String[]{"e:/output1","e:/output2"};
		
        // 1 获取配置信息，或者job对象实例
		Configuration configuration = new Configuration();
		Job job = Job.getInstance(configuration);
		
        // 2 指定本程序的jar包所在的本地路径
		job.setJarByClass(FlowsumDriver.class);
	
   		// 3 指定本业务job要使用的mapper/Reducer业务类
		job.setMapperClass(FlowCountMapper.class);
		job.setReducerClass(FlowCountReducer.class);

		// 4 指定mapper输出数据的kv类型
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(FlowBean.class);

		// 5 指定最终输出的数据的kv类型
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(FlowBean.class);

		// 8 指定自定义数据分区
		job.setPartitionerClass(ProvincePartitioner.class);
		// 9 同时指定相应数量的reduce task
		job.setNumReduceTasks(5);
		
		// 6 指定job的输入原始文件所在目录
		FileInputFormat.setInputPaths(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));

		// 7 将job中配置的相关参数，以及job所用的java类所在的jar包， 提交给yarn去运行
		boolean result = job.waitForCompletion(true);
		System.exit(result ? 0 : 1);
	}
}

3.4 WritableComparable 排序

排序概述

在这里插入图片描述

自定义排序 WritableComparable 原理分析

bean 对象做为 key 传输，需要实现 WritableComparable 接口重写 compareTo 方法，就可以实现排序。

@Override
public int compareTo(FlowBean bean) {
	int result;
	// 按照总流量大小，倒序排列
	if (this.sumFlow > bean.getSumFlow()) {
		result = -1;
	}else if (this.sumFlow < bean.getSumFlow()) {
		result = 1;
	}else {
		result = 0;
	}
	return result;
}

3.5 WritableComparable 排序案例实操（全排序）

1）需求

根据序列化案例产生的结果再次对总流量进行倒序排序。

（1）输入数据

原始数据 phone_data.txt

第一次处理后的数据 part-r-00000

（2）期望输出数据

13509468723 7335 110349 117684

13736230513 2481 24681 27162

13956435636 132 1512 1644

13846544121 264 0 264

。。。。。。

2）需求分析

在这里插入图片描述

3）代码实现

（1）FlowBean 对象在在需求1基础上增加了比较功能

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.WritableComparable;

public class FlowBean implements WritableComparable<FlowBean> {
	private long upFlow;
	private long downFlow;
	private long sumFlow;
	// 反序列化时，需要反射调用空参构造函数，所以必须有
	public FlowBean() {
		super();
	}
	public FlowBean(long upFlow, long downFlow) {
		super();
		this.upFlow = upFlow;
		this.downFlow = downFlow;
		this.sumFlow = upFlow + downFlow;
	}
	public void set(long upFlow, long downFlow) {
		this.upFlow = upFlow;
		this.downFlow = downFlow;
		this.sumFlow = upFlow + downFlow;
	}
	public long getSumFlow() {
		return sumFlow;
	}
	public void setSumFlow(long sumFlow) {
		this.sumFlow = sumFlow;
	}	
	public long getUpFlow() {
		return upFlow;
	}
	public void setUpFlow(long upFlow) {
		this.upFlow = upFlow;
	}
	public long getDownFlow() {
		return downFlow;
	}
	public void setDownFlow(long downFlow) {
		this.downFlow = downFlow;
	}
	/**
	 * 序列化方法
	 * @param out
	 * @throws IOException
	 */
	@Override
	public void write(DataOutput out) throws IOException {
		out.writeLong(upFlow);
		out.writeLong(downFlow);
		out.writeLong(sumFlow);
	}
	/**
	 * 反序列化方法 注意反序列化的顺序和序列化的顺序完全一致
	 * @param in
	 * @throws IOException
	 */
	@Override
	public void readFields(DataInput in) throws IOException {
		upFlow = in.readLong();
		downFlow = in.readLong();
		sumFlow = in.readLong();
	}
    
    // 重写 ToString()
	@Override
	public String toString() {
		return upFlow + "\t" + downFlow + "\t" + sumFlow;
	}
    
    // 排序
	@Override
	public int compareTo(FlowBean bean) {
		int result;
		// 按照总流量大小，倒序排列
		if (sumFlow > bean.getSumFlow()) {
			result = -1;
		}else if (sumFlow < bean.getSumFlow()) {
			result = 1;
		}else {
			result = 0;
		}
		return result;
	}
}

（2）编写Mapper类

import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class FlowCountSortMapper extends Mapper<LongWritable, Text, FlowBean, Text>{
	FlowBean bean = new FlowBean();
	Text v = new Text();

    @Override
	protected void map(LongWritable key, Text value, Context context)	throws IOException, InterruptedException {
		// 1 获取一行
		String line = value.toString();

		// 2 截取
		String[] fields = line.split("\t");

		// 3 封装对象
		String phoneNbr = fields[0];
		long upFlow = Long.parseLong(fields[1]);
		long downFlow = Long.parseLong(fields[2]);
		bean.set(upFlow, downFlow);
		v.set(phoneNbr);
		
        // 4 输出
		context.write(bean, v);
	}
}

（3）编写Reducer类

import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class FlowCountSortReducer extends Reducer<FlowBean, Text, Text, FlowBean>{
    
	@Override
	protected void reduce(FlowBean key, Iterable<Text> values, Context context)	throws IOException, InterruptedException {
		
		// 循环输出，避免总流量相同情况
		for (Text text : values) {
			context.write(text, key);
		}
	}
}

（4）编写Driver类

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class FlowCountSortDriver {
	public static void main(String[] args) throws ClassNotFoundException, IOException, InterruptedException {
		// 输入输出路径需要根据自己电脑上实际的输入输出路径设置
		args = new String[]{"e:/output1","e:/output2"};

		// 1 获取配置信息，或者job对象实例
		Configuration configuration = new Configuration();
		Job job = Job.getInstance(configuration);

		// 2 指定本程序的jar包所在的本地路径
		job.setJarByClass(FlowCountSortDriver.class);

		// 3 指定本业务job要使用的mapper/Reducer业务类
		job.setMapperClass(FlowCountSortMapper.class);
		job.setReducerClass(FlowCountSortReducer.class);

		// 4 指定mapper输出数据的kv类型
		job.setMapOutputKeyClass(FlowBean.class);
		job.setMapOutputValueClass(Text.class);

		// 5 指定最终输出的数据的kv类型
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(FlowBean.class);

		// 6 指定job的输入原始文件所在目录
		FileInputFormat.setInputPaths(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));

		// 7 将job中配置的相关参数，以及job所用的java类所在的jar包， 提交给yarn去运行
		boolean result = job.waitForCompletion(true);
		System.exit(result ? 0 : 1);
	}
}

3.6 WritableComparable 排序案例实操（区内排序）

1）需求

要求每个省份手机号输出的文件中按照总流量内部排序。

2）需求分析

基于前一个需求，增加自定义分区类，分区按照省份手机号设置。

在这里插入图片描述

3）案例实操

（1）增加自定义分区类

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

public class ProvincePartitioner extends Partitioner<FlowBean, Text> {

	@Override
	public int getPartition(FlowBean key, Text value, int numPartitions) {	

		// 1 获取手机号码前三位
		String preNum = value.toString().substring(0, 3);		
		int partition = 4;	

		// 2 根据手机号归属地设置分区
		if ("136".equals(preNum)) {
			partition = 0;
		}else if ("137".equals(preNum)) {
			partition = 1;
		}else if ("138".equals(preNum)) {
			partition = 2;
		}else if ("139".equals(preNum)) {
			partition = 3;
		}
		return partition;
	}
}

（2）在驱动类中添加分区类

// 加载自定义分区类
job.setPartitionerClass(ProvincePartitioner.class);

// 设置Reducetask个数
job.setNumReduceTasks(5);

3.7 Combiner 合并

在这里插入图片描述

（6）自定义Combiner实现步骤

（a）自定义一个 Combiner 继承 Reducer，重写 Reduce 方法

public class WordcountCombiner extends Reducer<Text, IntWritable, Text,IntWritable>{

	@Override
	protected void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {
        // 1 汇总操作
		int count = 0;
		for(IntWritable v :values){
			count += v.get();
		}
        // 2 写出
		context.write(key, new IntWritable(count));
	}
}

（b）在 Job 驱动类中设置：

job.setCombinerClass(WordcountCombiner.class);
//或者直接
job.setCombinerClass(WordcountReducer.class);

3.8 Combiner 合并案例实操

1）需求

统计过程中对每一个 MapTask 的输出进行局部汇总，以减小网络传输量即采用Combiner功能。

（1）数据输入 hello.txt

（2）期望输出数据

期望：Combine 输入数据多，输出时经过合并，输出数据降低。

2）需求分析

在这里插入图片描述

（1）增加一个 WordcountCombiner 类继承 Reducer

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class WordcountCombiner extends Reducer<Text, IntWritable, Text, IntWritable>{

	IntWritable v = new IntWritable();

    @Override
	protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        // 1 汇总
		int sum = 0;
		for(IntWritable value :values){
			sum += value.get();
		}
		v.set(sum);

        // 2 写出
		context.write(key, v);
	}
}

（2）在 WordcountDriver 驱动类中指定 Combiner

// 指定需要使用combiner，以及用哪个类作为combiner的逻辑
job.setCombinerClass(WordcountCombiner.class);

4）案例实操-方案二

（1）将 WordcountReducer 作为 Combiner 在 WordcountDriver 驱动类中指定

// 指定需要使用 Combiner，以及用哪个类作为 Combiner 的逻辑
job.setCombinerClass(WordcountReducer.class);

运行程序，如下图所示

在这里插入图片描述