Custom data types as keys in MapReduce

In the MapReduce programming model, keys are usually used for sorting and partitioning. Sorting refers to sorting the <k,v> key-value pairs according to the size of the key, and dividing refers to dividing <k,v> to the specified Reducer node according to the hashcode value of the key.

The key types in MapReduce must implement the WritableComparable interface. For the convenience of users, Hadoop provides some built-in key types. Common key types are IntWritable, LongWritable, Text, FloatWritable, etc. But sometimes we also need to use our own defined data types as keys.

The following small series uses the example of data table intersection to introduce how to use custom data types as keys.

There are the following two data tables, the data relationship belongs to the same model, and the fields are id, name, age, grade. The contents of the data tables table1 and table2 are shown below.

The purpose of the intersection is to output the same records in the two tables. As shown in table1 and table2 above, records with id 1 and 2 should be output.

The idea of finding the intersection is to output <r, 1> for each record r in the Map stage, then aggregate the count in the Reduce stage, and output the record r with a count of 2.

We use a Stu class to store a record and use the Stu class as the key. To implement the WritableComparable interface, the Stu class should pay attention to the following points:

There must be a no-argument constructor.
You must override the hashCode() method, equals() method and compareTo() method of the WritableComparable interface.

The code for the Stu class is as follows:

package Eg1;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.WritableComparable;

/**
 * Stu class design
 * @author liuchen
 */
public class Stu implements WritableComparable<Stu>
{
	private int id;
	private String name;
	private int age;
	private int grade;
	// no-argument constructor is essential
	public Stu () {	
	}
	public Stu(int a, String b, int c, int d){
		this.id = a;
		this.name = b;
		this.age = c;
		this.grade = d;
	}
	public void readFields(DataInput in) throws IOException {
		this.id = in.readInt();
		this.name = in.readUTF();
		this.age = in.readInt();
		this.grade = in.readInt();	
	}
	public void write(DataOutput out) throws IOException {
		out.writeInt(id);
		out.writeUTF(name);
		out.writeInt(age);
		out.writeInt(grade);
	}
	// sort by id in descending order
	public int compareTo(Stu o) {
		return this.id >= o.id ? -1 : 1 ;
	}
	public int hashCode() {
            return this.id + this.name.hashCode() + this.age + this.grade;
	}
	public boolean equals(Object obj) {
	    if (!(obj instanceof Stu)) {  
                return false;  
            }else{  
        	Stu r = (Stu) obj;
    		if (this.id == r.id && this.name.equals(r.name) && this.age == r.age && this.grade == r.grade){
    	            return true;
    		}else{
    	            return false;
    		}
            }	 
	}
	
	public String toString() {
		return Integer.toString(id) + "\t" + name + "\t" + Integer.toString(age) + "\t" + Integer.toString(grade);
	}
}

The map() function of the Map stage is as follows:

protected void map(LongWritable key, Text value,Context context)throws IOException, InterruptedException {
		final IntWritable one = new IntWritable(1);
		String[] arr = value.toString().split("\t");
		Stu stu = new Stu(Integer.parseInt(arr[0]),arr[1],Integer.parseInt(arr[2]),Integer.parseInt(arr[3]));
		context.write(stu,one);
}

The reduce() function of the Reduce phase is as follows:

protected void reduce(Stu arg0, Iterable<IntWritable> arg1,Context arg2)throws IOException, InterruptedException {
		int sum = 0;
		for(IntWritable val : arg1){
			sum += val.get();
		}
		if(sum == 2){
			arg2.write(arg0,NullWritable.get());
		}
}

Custom data types as keys in MapReduce

Guess you like