Hadoop利用Partitioner对输出文件分类

需求描述:

1. 对文件1.txt中统计每个单词的个数(wordcount)$ cat 1.txt

aa
bb
aa
dd
ff
rr
ee
aa
kk
jj
hh
uu
ii
tt
rr
tt
oo
uu
 

2. 输出文件限定为两个,其中一个存放aa~kk之间的单词,另外一个存放ll~zz之间的单词

解决方法:

MR默认的reduce输出分区为HashParition

public class HashPartitioner<K, V> extends Partitioner<K, V> { 

  /** Use {@link Object#hashCode()} to partition. */ 
  public int getPartition(K key, V value, 
                          int numReduceTasks) { 
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks; 
  } 

} 

重写改方法即可:

private static class MyPartitioner extends Partitioner<Text,IntWritable> {
		@Override
		public int getPartition(Text key, IntWritable value, int numReduceTasks) {
			if (key.toString().compareTo("aa") >= 0 && key.toString().compareTo("kk") <= 0) {
				return 0;
			} else {
				return 1;
			}
		}
	}

设定conf和job参数:

conf.set("mapred.reduce.tasks", "2");
job.setPartitionerClass(MyPartitioner.class);

输出结果:

$ hadoop fs -cat /lxw/output/part-r-00000
aa      3
bb      1
dd      1
ee      1
ff      1
hh      1
ii      1
jj      1
kk      1

$ hadoop fs -cat /lxw/output/part-r-00001
oo      1
rr      2
tt      2
uu      2
 

猜你喜欢

转载自superlxw1234.iteye.com/blog/1495465