First, demand explanation

1, the data file description

There are some storage temperature hdfs data file stored in text form, for example:

The date and time are intermediate spaces, as a whole, a timing detection monitoring site, followed by the temperature detected by the intermediate tab t separated.

postbird

2, demand

  1. Calculating 1949--1955 years, annual temperature descending order and outputting the stored file single year

The need for custom partitioning, custom grouping, custom sorting.

Second, the solution

1, ideas

  1. By year ascending order and then in descending order according to an annual temperature
  2. Grouped by year, each corresponding to a year reduce task

2, the custom mapper output type KeyPair

As can be seen, each row temperature tentatively called a data, each data has two parts, part of the time, the other is the temperature.

Therefore map output must customize the output format to use, and the need to customize the operation after the output of sorting and grouping the like, those are not effective default.

Defined KeyPair

Because the output type of output you want to customize the map to go into reduce operational, it is necessary to achieve WritableComparable hadoop interface, and the interface template variables have to be KeyPair, like LongWritable a meaning (see definition on LongWritable Know)

WritableComparable implement interfaces have to rewrite the write / readFileds / compareTo three methods, successively to the serialization / deserialization / Comparison

At the same time need to override toString and hashCode avoid problems of equals.

KeyPair defined as follows

It is noted that: when carrying out a sequence of output i.e. write, by which time the conversion of standard time format (file format of the time display) is performed by DataInput and DataOutput


import org.apache.hadoop.io.WritableComparable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

/**
 * Project   : hadooptest2
 * Package   : com.mapreducetest.temp
 * User      : Postbird @ http://www.ptbird.cn
 * TIME      : 2017-01-19 21:53
 */

/**
 * 为温度和年份封装成对象
 * year表示年份 而temp为温度
 */
public class KeyPair implements WritableComparable<KeyPair>{
    //年份
    private int year;
    //温度
    private int temp;

    public void setYear(int year) {
        this.year = year;
    }

    public void setTemp(int temp) {
        this.temp = temp;
    }

    public int getYear() {
        return year;
    }

    public int getTemp() {
        return temp;
    }
    @Override
    public int compareTo(KeyPair o) {
        //传过来的对象和当前的year比较 相等为0 不相等为1
        int result=Integer.compare(year,o.getYear());
        if(result != 0){
            //两个year不相等
            return 0;
        }
        //如果年份相等 比较温度
        return Integer.compare(temp,o.getTemp());
    }

    @Override
    //序列化
    public void write(DataOutput dataOutput) throws IOException {
       dataOutput.writeInt(year);
       dataOutput.writeInt(temp);
    }

    @Override
    //反序列化
    public void readFields(DataInput dataInput) throws IOException {
        this.year=dataInput.readInt();
        this.temp=dataInput.readInt();
    }

    @Override
    public String toString() {
        return year+"\t"+temp;
    }

    @Override
    public int hashCode() {
        return new Integer(year+temp).hashCode();
    }
}

3 custom grouping

Will be put together with temperature monitoring of the year, so the need for year comparison.

Therefore, input data comparison in the year can pay attention to the type of comparison is at this time of KeyPair, Map out this type of output also.

Because inherited WritableComparator, it is necessary to rewrite compare methods, comparison is KeyPair (KeyPair realized WritableComparable Interface), their actual comparison of the year, the same year got 0


/**
 * Project   : hadooptest2
 * Package   : com.mapreducetest.temp
 * User      : Postbird @ http://www.ptbird.cn
 * TIME      : 2017-01-19 22:08
 */

import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;

/**
 *  为温度分组 比较年份即可
 */
public class GroupTemp extends WritableComparator{

    public GroupTemp() {
        super(KeyPair.class,true);
    }
    @Override
    public int compare(WritableComparable a, WritableComparable b) {
        //年份相同返回的是0
        KeyPair o1=(KeyPair)a;
        KeyPair o2=(KeyPair)b;
        return Integer.compare(o1.getYear(),o2.getYear());
    }
}

4, custom partition

The purpose custom partitioning is based on the year after a group of good points, different years to create different reduce task task, requiring years of treatment.


import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

/**
 * Project   : hadooptest2
 * Package   : com.mapreducetest.temp
 * User      : Postbird @ http://www.ptbird.cn
 * TIME      : 2017-01-19 22:17
 */

//自定义分区
//每一个年份生成一个reduce任务
public class FirstPartition extends Partitioner<KeyPair,Text>{
    @Override
    public int getPartition(KeyPair key, Text value, int num) {
        //按照年份进行分区 年份相同,返回的是同一个值
        return (key.getYear()*127)%num;
    }
}

5, custom sorting

The final comparison is still sort of temperature, so this part is very important.

According to the above requirements, the need for Health order year be sorted, while the temperature in descending order, with the proviso that the preferred comparison year .


/**
 * Project   : hadooptest2
 * Package   : com.mapreducetest.temp
 * User      : Postbird @ http://www.ptbird.cn
 * TIME      : 2017-01-19 22:08
 */

import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;

/**
 *  为温度排序的封装类
 */
public class SortTemp extends WritableComparator{

    public SortTemp() {
        super(KeyPair.class,true);
    }
    //自定义排序
    @Override
    public int compare(WritableComparable a, WritableComparable b) {
        //按照年份升序排序 按照温度降序排序
        KeyPair o1=(KeyPair)a;
        KeyPair o2=(KeyPair)b;
        int result=Integer.compare(o1.getYear(),o2.getYear());
        //比较年份 如果年份不相等
        if(result != 0){
            return result;
        }
        //两个年份相等 对温度进行降序排序,注意 - 号
        return -Integer.compare(o1.getTemp(),o2.getTemp());
    }
}

6, writing MapReduce programs

Several noteworthy points:

    1. Preceding the time the data file is a character string, but our KeyPair the set is not a string, the string is required to format the date transfer operation, using the SimpleDateFormat, naturally format "yyyy-MM-dd HH: mm: ss "the.
    2. After entering the data for each row, through the regular matching of "t" tab, and then the temperature and time are separated, and the obtained year time format, the second portion of the strip-symbol string "℃" numbers obtained, and then create the type KeyPair data can be output.
  1. Each year generates a reduce task is based on a custom partition of the year compared processing, the output simply put the map in order to reduce the output once again, three reduce task, it will generate three output files.
  2. Because of the use of custom sorting, grouping, zoning, and therefore need to specify the relevant class, while the number also reduce task needs to be performed.
  3. In fact, the last client or eight-legged essay fixed form only, but more than the specified custom, nothing else.

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

import java.io.IOException;
import java.net.URI;
import java.text.SimpleDateFormat;
import java.util.Calendar;
import java.util.Date;

/**
 * Project   : hadooptest2
 * Package   : com.mapreducetest.temp
 * User      : Postbird @ http://www.ptbird.cn
 * TIME      : 2017-01-19 22:28
 */
public class RunTempJob {
    //字符串转日期format
    public static SimpleDateFormat SDF=new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
    /**
     * Mapper
     * 输出的Key是自定义的KeyPair
     */
    static class TempMapper extends Mapper<LongWritable,Text,KeyPair,Text>{
        protected void map(LongWritable key,Text value,Context context) throws IOException,InterruptedException{
            String line=value.toString();
            //1949-10-01 14:21:02    34℃
            // 前面是空格 时间和温度通过\t分割
            String[] ss=line.split("\t");
//            System.err.println(ss.length);
            if(ss.length==2){
                try{
                    //获得日期
                    Date date=SDF.parse(ss[0]);
                    Calendar c=Calendar.getInstance();
                    c.setTime(date);
                    int year=c.get(1);//得到年份
                    //字符串截取得到温度,去掉℃
                    String temp = ss[1].substring(0,ss[1].indexOf("℃"));
                    //创建输出key 类型为KeyPair
                    KeyPair kp=new KeyPair();
                    kp.setYear(year);
                    kp.setTemp(Integer.parseInt(temp));
                    //输出
                    context.write(kp,value);
                }catch(Exception ex){
                    ex.printStackTrace();
                }
            }
        }
    }
    /**
     *  Reduce 区域
     *  Map的输出是Reduce的输出
     */
    static class TempReducer extends Reducer<KeyPair,Text,KeyPair,Text> {
        @Override
        protected void reduce(KeyPair kp, Iterable<Text> values, Context context) throws IOException, InterruptedException {
            for (Text value:values){
                context.write(kp,value);
            }
        }
    }

    //client
    public static void main(String args[]) throws IOException, InterruptedException{
        //获取配置
        Configuration conf=new Configuration();

        //修改命令行的配置
        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
        if (otherArgs.length != 2) {
            System.err.println("Usage: temp <in> <out>");
            System.exit(2);
        }
        //创建Job
        Job job=new Job(conf,"temp");
        //1.设置job运行的类
        job.setJarByClass(RunTempJob.class);
        //2.设置map和reduce的类
        job.setMapperClass(RunTempJob.TempMapper.class);
        job.setReducerClass(RunTempJob.TempReducer.class);
        //3.设置map的输出的key和value 的类型
        job.setMapOutputKeyClass(KeyPair.class);
        job.setMapOutputValueClass(Text.class);
        //4.设置输入文件的目录和输出文件的目录
        FileInputFormat.addInputPath(job,new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job,new Path(otherArgs[1]));
        //5.设置Reduce task的数量 每个年份对应一个reduce task
        job.setNumReduceTasks(3);//3个年份
        //5.设置partition sort Group的class
        job.setPartitionerClass(FirstPartition.class);
        job.setSortComparatorClass(SortTemp.class);
        job.setGroupingComparatorClass(GroupTemp.class);
        //6.提交job 等待运行结束并在客户端显示运行信息
        boolean isSuccess= false;
        try {
            isSuccess = job.waitForCompletion(true);
        } catch (ClassNotFoundException e) {
            e.printStackTrace();
        }
        //7.结束程序
        System.exit(isSuccess ?0:1);
    }
}

Third, the effect is generated:

HDFS three reduce task generates three outputs.

postbird

Each output file is an annual ranking of the temperature of the results:

postbird

As can be seen, 1951 is a map (it can be said KeyPair) output of the year, the temperature is 46, while the back is the text and the output once every year is sorted in descending order according to demand. )