partition分区

1)默认partition分区

public class HashPartitioner<K, V> extends Partitioner<K, V> {
  /** Use {@link Object#hashCode()} to partition. */
  public int getPartition(K key, V value, int numReduceTasks) {
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
  }
}

默认分区是根据key的hashCode对reduceTasks个数取模得到的。用户没法控制哪个key存储到哪个分区

2)自定义Partitioner步骤

(1)自定义类继承Partitioner,重新getPartition()方法

package com.atguigu.mapreduce.flow;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
public class FlowPartitioner extends Partitioner<Text,FlowBean>{
	@Override
	public int getPartition(Text key, FlowBean value, int numPartitions) {
		//1 需求:根据电话号码的前三位是几来分区
		//拿到电话号码的前三位
		String phoneNum = key.toString().substring(0, 3);
		int partitions = 4;
		if ("135".equals(phoneNum)) {
			partitions = 0;
		}else if("136".equals(phoneNum)) {
			partitions = 1;
		}else if ("137".equals(phoneNum)) {
			partitions = 2;
		}else if ("138".equals(phoneNum)) {
			partitions = 3;
		}
		return partitions;
	}
	
}

(2)在job驱动中,设置自定义partitioner

job.setPartitionerClass(CustomPartitioner.class)

(3)自定义partition后,要根据自定义partitioner的逻辑设置相应数量的reduce task

job.setNumReduceTasks(5);

3)注意:

如果reduceTask的数量> getPartition的结果数,则会多产生几个空的输出文件part-r-000xx;

如果1<reduceTask的数量<getPartition的结果数,则有一部分分区数据无处安放,会Exception;

如果reduceTask的数量=1,则不管mapTask端输出多少个分区文件,最终结果都交给这一个reduceTask,最终也就只会产生一个结果文件 part-r-00000;

例如假设自定义分区数为5,

(1)job.setNumReduceTasks(1);正常运行,只不过会产生一个输出文件

(2)job.setNumReduceTasks(2);报错

(3)job.setNumReduceTasks(6);大于5,程序会正常运行,会产生空文件

猜你喜欢

转载自blog.csdn.net/qq_40310148/article/details/86652400
今日推荐