MapReduce group sorting case

Case needs analysis:

Find the most expensive item in each order

Custom data type OrderBean class:

1Implement WritableComparable interface

2 must have a null parameter constructor

3. To implement serialization and deserialization, the order must be maintained.

4 Override toString() method

5 Override compareTo


import org.apache.hadoop.io.WritableComparable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class OrderBean implements WritableComparable<OrderBean> {
   int order_id;//订单id
   String item_id;//商品id
   float price;//价格

    public OrderBean() {
        super();
    }


    @Override
    public String toString() {
        return order_id+"\t"+item_id+"\t"+price;
    }
    //二次排序先按照订单大小排序再按照订单金额排序
    public int compareTo(OrderBean o) {
            if(this.order_id==o.order_id)
                return this.price>o.price?-1:1;
            else
                return this.order_id>o.order_id?1:-1;
    }

    public void write(DataOutput dataOutput) throws IOException {
        dataOutput.writeInt(order_id);
        dataOutput.writeUTF(item_id);//序列化字符串
        dataOutput.writeFloat(price);
    }

    public void readFields(DataInput dataInput) throws IOException {
            this.order_id=dataInput.readInt();
            this.item_id=dataInput.readUTF();//反序列化字符串
        this.price=dataInput.readFloat();
    }

    public int getOrder_id() {
        return order_id;
    }

    public void setOrder_id(int order_id) {
        this.order_id = order_id;
    }

    public String getItem_id() {
        return item_id;
    }

    public void setItem_id(String item_id) {
        this.item_id = item_id;
    }

    public float getPrice() {
        return price;
    }

    public void setPrice(float price) {
        this.price = price;
    }
}

MyMapper class:

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class MyMapper extends Mapper<LongWritable, Text,OrderBean, NullWritable> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line=value.toString();
        String[] strings = line.split("\\t");

        OrderBean k = new OrderBean();
        k.setOrder_id(Integer.parseInt(strings[0]));
        k.setItem_id(strings[1]);
        k.setPrice(Float.parseFloat(strings[2]));

       context.write(k,NullWritable.get());
        System.out.println(k);
    }
}

MyReducer class:

Due to the group sorting, those with the same order ID are divided into one group and sorted according to the amount from large to small, so the first data in each group is the demand data, context.write(key, NullWritable.get()); That is, just output the first data

import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class MyReducer extends Reducer<OrderBean, NullWritable,OrderBean,NullWritable> {
    @Override
    protected void reduce(OrderBean key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
        context.write(key,NullWritable.get());
    }
}

MyGroupingComparator class:

1Inherit WritableComparator

2 Rewrite the compare method to group based on

3. There must be a constructor, and the class to be compared (the class that implements WritableComparable) is passed into super(OrderBean.class,true);

import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;

public class MyGroupingComparator extends WritableComparator {
    public MyGroupingComparator() {
        super(OrderBean.class,true);
    }

    @Override
    public int compare(WritableComparable a, WritableComparable b) {
        OrderBean aBean= (OrderBean) a;
        OrderBean bBean= (OrderBean) b;

        int res=0;
        if(aBean.getOrder_id()==bBean.getOrder_id())
            res=0;
        else if(aBean.getOrder_id()>bBean.getOrder_id())
            res=1;
        else
            res=-1;
        return res;
    }
}

MyDriver class:

Mainly set which class to use for group sorting:

job.setGroupingComparatorClass(MyGroupingComparator.class);

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class MyDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        job.setJarByClass(MyDriver.class);

        job.setMapperClass(MyMapper.class);
        job.setReducerClass(MyReducer.class);

        job.setMapOutputKeyClass(OrderBean.class);
        job.setMapOutputValueClass(NullWritable.class);

        job.setOutputKeyClass(OrderBean.class);
        job.setOutputValueClass(NullWritable.class);

       job.setGroupingComparatorClass(MyGroupingComparator.class);


        FileInputFormat.addInputPath(job,new Path("/home/hadoop/temp/order.txt"));
        FileOutputFormat.setOutputPath(job,new Path("/home/hadoop/temp/order_RES"));

        FileSystem.get(conf).delete(new Path("/home/hadoop/temp/order_RES"),true);


        boolean b = job.waitForCompletion(true);

        System.exit(b?0:1);
    }
}

operation result:

Summary of problems encountered:

1 Reasons why reduce() doesn’t work:

a There is no context.write() or the job class forgot to set the Reducer class

b The type of (k, v) output by the map stage matches the type of (k, v) input by the reduce stage.

c In a fully distributed system, it is possible that the communication between nodes cannot be connected, causing the reduceTask node to be unable to pull the data output by the mapTask node through the network.

d When serialization exists: Serialization errors will also cause the reduceTask node to be unable to deserialize the pulled data.

2 Use writeUTF() and readUTF() respectively when serializing and deserializing the String type.