利用MapReduce进行二次排序--附例子

首先先来明确几个概念：
1.分区-partition
1）分区（partition）：
默认采取散列值进行分区，但此方法容易造成 “ 数据倾斜 ” （大部分数据分到同一个reducer中，影响运行效率）；
所以需要自定义partition;
2)分区概念：*** 指定key/value被分配到哪个reducer上
  哪个key到哪个Reducer的分配过程，是由Partitioner规定的；
  （重写：getPartition(Text key, Text value, int numPartitions)）
3）如何自定义partition？？
只要自定义一个类，并且继承Partitioner类，重写其getPartition方法就好了，在使用的时候通过调用Job的 setPartitionerClass  指定一下即可。

4)系统默认的分区partition
  系统缺省的Partitioner是HashPartitioner，它以key的Hash值对Reducer的数目取模，得到对应的Reducer。这样就保证如果有相同的key值，肯定被分配到同一个reducre上
5）执行过程
    Map的结果，会通过partition分发到Reducer上。如果设置了Combiner，Map的结果会先送到Combiner进行合并，再 partition,再将合并后数据发送给Reducer。

2.分组grouping
1)概念：
主要定义哪些key可以放置在一组；
2）自定义分组排序
定义实现一个WritableComparator，重写compare(),  设置比较策略；
还需要声明：自定义分组的类
    job.setGroupingComparatorClass(SencondarySortGroupComparator.class);//自定义分组
3)分组之后的组内排序--（实现优化）
也就是自定义RawComparator类，系统默认；
4)  如何自定义组内的排序呢？如下：
继承WritableComparator，重写compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2)方法；
还需要声明：
   job.setSortComparatorClass(SencondarySortComparator.class);//自定义组内排序

先编写一个案例，加深二次排序的映像：
所谓二次排序，对第1个字段相同的数据，使用第2个字段进行排序。
举个例子，电商平台记录了每一用户的每一笔订单的订单金额，现在要求属于同一个用户的所有订单金额作排序，
并且输出的用户名也要排序。
       账户(account)   订单金额(Cost)
hadoop@apache 200
hive@apache 550
yarn@apache 580
hive@apache 159
hadoop@apache 300
hive@apache 258
hadoop@apache 300

二次排序后的结果如下：
           账户(account)   订单金额(Cost)
           hadoop@apache 200
           hadoop@apache 300
           hadoop@apache 300
           hive@apache 159
           hive@apache 258
           hive@apache 550
           yarn@apache 580
代码部分：
a.实现自定义Writable类

public class AccountBean  implements WritableComparable<AccountBean>{
	private Text accout;
	private IntWritable cost;
	public AccountBean() {
		setAccout(new Text());
		setCost(new IntWritable());
	}
	public AccountBean(Text accout, IntWritable cost) {
		this.accout = accout;
		this.cost = cost;
	}
	@Override
	public void write(DataOutput out) throws IOException {
		accout.write(out);
		cost.write(out);
	}
	@Override
	public void readFields(DataInput in) throws IOException {
		accout.readFields(in);
		cost.readFields(in);
	}
	@Override
	public int compareTo(AccountBean o) {
		int tmp = accout.compareTo(o.accout);
		if(tmp ==0){
			return cost.compareTo(o.cost);
		}
		return tmp;
	}
	public Text getAccout() {
		return accout;
	}

	public void setAccout(Text accout) {
		this.accout = accout;
	}

	public IntWritable getCost() {
		return cost;
	}

	public void setCost(IntWritable cost) {
		this.cost = cost;
	} 
    @Override
    public String toString() {
	return accout + "\t" + cost;
    }
}

b.自定义partition：按account进行分区：--根据key或value及reduce的数量来决定当前的
这对输出数据最终应该交由哪个reduce task处理

  public class SencondarySortPartition extends Partitioner<AccountBean, NullWritable> {
            @Override
            public int getPartition(AccountBean key, NullWritable value,int numPartitions) {
                return (key.getAccout().hashCode() & Integer.MAX_VALUE) % numPartitions;
            }
  }

c.自定义分组比较器：按account进行分组：--key相同的在一个组内；最后执行是组的并行性

public class SencondarySortGroupComparator extends WritableComparator {
			public SencondarySortGroupComparator() {
				super(AccountBean.class,true);
			}
			
			@Override
			public int compare(WritableComparable a, WritableComparable b) {
				AccountBean acc1 = (AccountBean)a;
				AccountBean acc2 = (AccountBean)b;
				return acc1.getAccout().compareTo(acc2.getAccout());//账号相同的在一个组
			}
	}

d.自定义RawComparator类：--主要是实现在组内的排序（有利于优化）,可省略！！!

public class SencondarySortComparator extends WritableComparator {
			private static final IntWritable.Comparator INTWRITABLE_COMPARATOR = new IntWritable.Comparator();

			public SencondarySortComparator() {
				super(AccountBean.class);
			}
			@Override
			public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
				try {
					int firstL1 = WritableUtils.decodeVIntSize(b1[s1])+ readVInt(b1, s1);
					int firstL2 = WritableUtils.decodeVIntSize(b2[s2])+ readVInt(b2, s2);
					int cmp = INTWRITABLE_COMPARATOR.compare(b1, s1, firstL1, b2, s2, firstL2);
					if (cmp != 0) {
						return cmp;
					}
					return INTWRITABLE_COMPARATOR.compare(b1, s1 + firstL1, l1 - firstL1, b2,s2 + firstL2, l2 - firstL2);
				} catch (IOException e) {
					throw new IllegalArgumentException(e);
				}
			}

		//	static {
		//		WritableComparator.define(AccountBean.class,new SencondarySortComparator());
		//	}
		}

e.编写Mapper

public class SencondarySortMapper extends Mapper<LongWritable, Text, AccountBean, NullWritable> {
			private AccountBean acc = new AccountBean();
			@Override
			protected void map(LongWritable key, Text value,Context context)
					throws IOException, InterruptedException {
				StringTokenizer st = new StringTokenizer(value.toString());
				while (st.hasMoreTokens()) {
					acc.setAccout(new Text(st.nextToken()));
					acc.setCost(new IntWritable(Integer.parseInt(st.nextToken())));
				}
				context.write(acc ,NullWritable.get());
			}
		}

f.编写Reducer

 public class SencondarySortReducer extends Reducer<AccountBean, NullWritable, AccountBean, NullWritable>{
            @Override
            protected void reduce(AccountBean key, Iterable<NullWritable> values,Context context)
                throws IOException, InterruptedException {
            for (NullWritable nullWritable : values) {
                context.write(key, NullWritable.get());
                }
            }
        }

g.编写主类Driver

public static void main(String[] args) throws Exception {
		Configuration conf = new Configuration();
		Path outfile = new Path("file:///D:/outtwo1");
		FileSystem fs = outfile.getFileSystem(conf);
		if(fs.exists(outfile)){
			fs.delete(outfile,true);
		}
		Job job = Job.getInstance(conf);
		job.setJarByClass(SencondarySortDriver.class);
		job.setJobName("Sencondary Sort");
		job.setMapperClass(SencondarySortMapper.class);  
		job.setReducerClass(SencondarySortReducer.class);
		
		job.setOutputKeyClass(AccountBean.class);
		job.setOutputValueClass(NullWritable.class);
		//声明自定义分区和分组
		job.setPartitionerClass(SencondarySortPartition.class);
		job.setGroupingComparatorClass(SencondarySortGroupComparator.class);
     //job.setSortComparatorClass(SencondarySortComparator.class);//组内排序需要声明的类

		FileInputFormat.addInputPath(job, new Path("file:///D:/测试数据/二次排序/"));
		FileOutputFormat.setOutputPath(job,outfile);
		System.exit(job.waitForCompletion(true)?0:1);
	}

I. 运行结果

            hadoop@apache          200
            hadoop@apache          300
            hadoop@apache          300
            hive@apache               159
            hive@apache               258
            hive@apache               550
            yarn@apache               580

总结：
理解分区和分组的概念；
分区：指定key/value到哪个Reducer中；
分组：相同的key在一个组group中，执行Reducer Task它会并行处理组，
提高运行效率；要是没有组，它会处理很多个reducer任务；

一个小案例：
分别对map task和reducer task数分别计数，看它们分别执行多少次；
1）无组
runmap
map运行次数=17
runreducer
reducer运行次数=10
2）有组
runmap
map运行次数=17
runreducer
reducer运行次数=3
3）可以看出，分组之后，reducer task数明显减少，有利于提高效率！！

利用MapReduce进行二次排序--附例子

猜你喜欢