hadoop mapreduce求解有序TopN

利用hadoop的map和reduce排序特性实现对数据排序取TopN条数据。

1、样本数据，假设是订单数据，求解按订单id排序且每个订单里价格最高前三，从高到低排序。

订单ID  商品ID   单价
0000001	Pdt_01	222.8
0000002	Pdt_05	722.4
0000001	Pdt_02	33.8
0000003	Pdt_06	232.8
0000003	Pdt_02	33.8
0000002	Pdt_03	522.8
0000002	Pdt_04	122.4
0000001	Pdt_01	122.8
0000002	Pdt_05	522.4
0000003	Pdt_02	133.8
0000002	Pdt_03	222.8
0000002	Pdt_04	222.4
0000001	Pdt_01	322.8
0000002	Pdt_05	322.4

2、求解思路

1.将订单封装成bean，以bean对象作为map的key，这样才能利用hadoop的key自动排序特性。
2.实现WritableComparable接口，bean以id升序，价格降序实现比较接口，这样数据在map后进入shuffle阶段会实现自定义规则自排序。
3、reduce阶段，如果不做任何处理数据将呈现将按订单升序，价格降序。但是订单id相同，价格不同的订单将不能使用同一个reduce函数，也不能求解topN（是指利用key排序特性的topN，否则实现topN的手段还有很多）。
4、使用reduce阶段分组特性接口WritableComparator，在reduce归并前，将对数据进行分组，以决定什么样的数据进入同一分组里，即同一个reduce里。
5、在实现WritableComparator的类中，以bean为基础，我们将订单id作为比较项忽略价格因素，实现同一id，进入同一个分组，价格从高到低已经在bean里排序实现，shuffle阶段也遵循了这个原则，所以在reduce阶段不考虑价格排序问题。
6、最后一个难点，通过给reduce的数据分组，传递到reduce里key就是同一个订单id最大价格的订单，一般情况下，我们从map阶段传递过来的values都是null，reduce阶段也是一个null值的迭代器，如何求topN呢，这个时候实现的只是max。
7、这个才是最后一个步骤，null的迭代器里其实存了分组里每一个值，包括key和value（虽然value是null），而这个迭代器里key和reduce的key是共享地址，也就是指向同一个变量，
   当我们使用迭代器滚动取值的时候，reduce里的key的值也被迭代里的key值重新赋值（它们指向同一内存地址），所以在执行迭代过程中，我们就可以轻松求解TopN。

3、code

3.1 OrderBean

public class OrderBean implements WritableComparable<OrderBean> {

    private int order_id;
    private double price;

    public OrderBean() {
    }

    public OrderBean(int order_id, double price) {
        this.order_id = order_id;
        this.price = price;
    }

    @Override
    public int compareTo(OrderBean o) {
        //订单id升序，价格降序
        if(this.getOrder_id()>o.getOrder_id()){
            return 1;
        }else if (this.getOrder_id()<o.getOrder_id()){
            return -1;
        }
        else{
            if(this.getPrice()>o.getPrice()){
                return -1;
            }else if(this.getPrice()<o.getPrice()){
                return 1;
            }else{
                return 0;
            }
        }
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeInt(order_id);
        out.writeDouble(price);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        this.order_id = in.readInt();
        this.price = in.readDouble();
    }

    public int getOrder_id() {
        return order_id;
    }

    public void setOrder_id(int order_id) {
        this.order_id = order_id;
    }

    public double getPrice() {
        return price;
    }

    public void setPrice(double price) {
        this.price = price;
    }

    @Override
    public String toString() {
        return "order_id=" + order_id +
                ", price=" + price;
    }
}

3.2 mapper

public class OrderMapper extends Mapper<LongWritable, Text,OrderBean, NullWritable> {
    OrderBean k = new OrderBean();

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // 1 获取一行
        String line = value.toString();

        // 2 分割
        String[] fields = line.split("\\s+");

        // 3 封装对象
        k.setOrder_id(Integer.parseInt(fields[0]));
        k.setPrice(Double.parseDouble(fields[2]));

        // 4 写出，value值是null
        context.write(k, NullWritable.get());

    }
}

3.3 reducer的分组规则

/**
 * @Author: xu.dm
 * @Date: 2019/8/30 16:15
 * @Version: 1.0
 * @Description: reducer数据分组，在数据从map阶段送到reducer后，在归并执行前，重新进行分组
 * 通过这种方式，重新调整数据进入reducer的key值
 *
 * 本例中，map送过来的key是OrderBean,也是按Orderbean排序（id升序，价格降序），
 * 数据送到reduce后，如果不分组，那么相同id不同价格的数据被认为是不同的key，
 * 经过自定义分组（继承WritableComparator），只使用id作为分组的条件，
 * reduce在归并前key的数据只按id判断，价格被忽略，
 * 那么：{1，300}和{1,200}这种数据就会被认为是相同的key，即key=1，其他忽略，所以最终输出的时候，价格只保留最高。
 **/
public class OrderSortGroupingComparator extends WritableComparator {


    protected OrderSortGroupingComparator() {
        super(OrderBean.class, true);
    }

    @Override
    public int compare(WritableComparable a, WritableComparable b) {
        OrderBean aBean = (OrderBean)a;
        OrderBean bBean = (OrderBean)b;
        if(aBean.getOrder_id()>bBean.getOrder_id()){
            return 1;
        }else if(aBean.getOrder_id()<bBean.getOrder_id()){
            return -1;
        }else {
            return 0;
        }
    }
}

3.4 reducer

/**
 * @Author: xu.dm
 * @Date: 2019/8/30 16:21
 * @Version: 1.0
 * @Description: 如果不执行Iterable<NullWritable> values迭代，直接取key
 * 那么key根据分组只保留从map->shuffle->reduce流程里第一个排序值，如果key是一个bean对象（即复合键），key就是排序输出的第一个对象。
 * 
 * 如果执行Iterable<NullWritable> values迭代，那么迭代器滚动数据过程中，会依次对OrderBean key赋值，
 * 原理是Iterable<NullWritable> values里也存了key值，滚动中key被取出，而迭代器里key和reduce里key公用内存地址（复用）
 * 所以迭代器滚动过程，对key和value都进行了赋值
 * 可以用 OrderBean mykey = new OrderBean(key.getOrder_id(),key.getPrice());来测试，mykey是不会跟着迭代器滚动的。
 * 通过这个特性，可以实现排序数据取TopN
 **/
public class OrderReducer extends Reducer<OrderBean, NullWritable,OrderBean,NullWritable> {
    @Override
    protected void reduce(OrderBean key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
//        OrderBean mykey = new OrderBean(key.getOrder_id(),key.getPrice());
        //实现topN数据输出
        int topN = 3;
        for (NullWritable value : values) {
            System.out.println(key.hashCode());
            context.write(key,NullWritable.get());
            topN--;
            if(topN<=0)break;
        }


    }
}

3.5 driver

/**
 * @Author: xu.dm
 * @Date: 2019/8/30 16:25
 * @Version: 1.0
 * @Description: 求解每个订单最大单价订单以及取TopN。
 **/
public class OrderDriver {
    public static void main(String[] args) throws Exception, IOException {
        if(args.length!=2)
        {
            System.err.println("使用格式：FlowSortedDriver <input path> <output path>");
            System.exit(-1);
        }


        // 1 获取配置信息
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        // 2 设置jar包加载路径
        job.setJarByClass(OrderDriver.class);

        // 3 加载map/reduce类
        job.setMapperClass(OrderMapper.class);
        job.setReducerClass(OrderReducer.class);

        // 4 设置map输出数据key和value类型
        job.setMapOutputKeyClass(OrderBean.class);
        job.setMapOutputValueClass(NullWritable.class);

        // 5 设置最终输出数据的key和value类型
        job.setOutputKeyClass(OrderBean.class);
        job.setOutputValueClass(NullWritable.class);

        // 6 设置输入数据和输出数据路径
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        // 8 设置reduce端的分组
        job.setGroupingComparatorClass(OrderSortGroupingComparator.class);

        // 7 提交
       //测试环境下，可以先删除目标目录
        Path outPath = new Path(args[1]);
        FileSystem fs = FileSystem.get(conf);
        if(fs.exists(outPath)){
            fs.delete(outPath,true);
        }

        long startTime = System.currentTimeMillis();
        boolean result = job.waitForCompletion(true);

        long endTime = System.currentTimeMillis();
        long timeSpan = endTime - startTime;
        System.out.println("运行耗时："+timeSpan+"毫秒。");

        System.exit( result ? 0 : 1);
    }

}

3.6 刷出结果

order_id=1, price=322.8
order_id=1, price=222.8
order_id=1, price=122.8
order_id=2, price=722.4
order_id=2, price=522.8
order_id=2, price=522.4
order_id=3, price=232.8
order_id=3, price=133.8
order_id=3, price=33.8

hadoop mapreduce求解有序TopN

猜你喜欢