hadoop链式处理

场景说明：

我们使用hadoop集群处理文本的时候，如果想要复用之前的mapper，动态灵活的增加或者减少某些业务逻辑就可能会用到。

对于以下的文本，我们只对含有“房价”的单子或者句子感兴趣，又想要复用之前的mapper逻辑（word->[word,1]）,对于五次以下的单词还要过滤掉，我们可以在一次job中解决。

石家庄房价真高啊
石家庄房价真高啊
石家庄房价真高啊
石家庄房价真高啊
石家庄房价还可以吧
石家庄房价还可以吧
石家庄房价还可以吧
石家庄房价还可以吧
石家庄房价太特么高了
石家庄房价太特么高了
石家庄房价太特么高了
石家庄房价太特么高了
石家庄房价太特么高了
石家庄房价太特么高了
石家庄房价太特么高了
石家庄房价太特么高了
北京房价便宜
北京房价便宜
北京房价便宜
北京房价便宜
北京房价便宜
北京房价便宜
北京房价便宜
北京房价便宜
北京房价便宜
hello tom
hello tom
hello tom
hello tom
hello tom
hello tom
hello tom
hello tom
hello tom2
hello tom2
hello tom2

mapper1,word->[word,1]

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;
import java.util.StringTokenizer;

/**
 * word count mapper类
 * 第一次map处理
 */
public class WcMapper1 extends Mapper<LongWritable,Text,Text,IntWritable>{

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        Text keyOut = new Text();
        IntWritable valueOut = new IntWritable();
        StringTokenizer tokenizer = new StringTokenizer(value.toString());
        while(tokenizer.hasMoreTokens()){
            keyOut.set(tokenizer.nextToken());
            valueOut.set(1);
            context.write(keyOut,valueOut);
        }
    }
}

mapper2 过滤出含有“房价”的词

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * word count mapper类
 * 第二次map处理
 */
public class WcMapper2 extends Mapper<Text,IntWritable,Text,IntWritable>{

    @Override
    protected void map(Text key, IntWritable value, Context context) throws IOException, InterruptedException {
        if(key.toString().indexOf("房价") != -1){
            context.write(key,value);
        }
    }
}

reduce 统计

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;
import java.util.Iterator;

/**
 * word count reducer
 */
public class WcReducer extends Reducer<Text,IntWritable,Text,IntWritable>{
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        Iterator<IntWritable> iterator = values.iterator();
        int count = 0;
        while (iterator.hasNext()){
            count += iterator.next().get();
        }
        context.write(key,new IntWritable(count));
    }
}

reduceMapper 过滤掉5次以下的统计结果

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class WcReducerMapper1 extends Mapper<Text,IntWritable, Text, IntWritable> {
    protected void map(Text key, IntWritable value, Context context) throws IOException, InterruptedException {
        if(value.get() > 5){
            context.write(key,value);
        }
    }
}

运行主函数app

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.chain.ChainMapper;
import org.apache.hadoop.mapreduce.lib.chain.ChainReducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WcApp {

    public static void main(String[] args) throws Exception {

        Configuration conf = new Configuration();
        conf.set("fs.defaultFS","file:///");
        Job job = Job.getInstance(conf);

        //设置job的各种属性
        job.setJobName("WcChainApp");                        //作业名称
        job.setJarByClass(WcApp.class);                 //搜索类
        job.setInputFormatClass(TextInputFormat.class); //设置输入格式

        //添加输入路径
        FileInputFormat.addInputPath(job,new Path("/home/hadoop/chain/fangjia.txt"));
        //设置输出路径
        FileOutputFormat.setOutputPath(job,new Path("/home/hadoop/chain/out"));

        //在mapper链条上增加Mapper1
        ChainMapper.addMapper(job,WcMapper1.class, LongWritable.class,Text.class,Text.class,IntWritable.class, conf);
        //在mapper链条上增加Mapper2
        ChainMapper.addMapper(job,WcMapper2.class, Text.class, IntWritable.class,Text.class,IntWritable.class, conf);

        //在reduce链条上设置reduce
        ChainReducer.setReducer(job,WcReducer.class,Text.class,IntWritable.class,Text.class,IntWritable.class,conf);
        //在reduce链条上增加Mapper2
        ChainReducer.addMapper(job,WcReducerMapper1.class, Text.class, IntWritable.class,Text.class,IntWritable.class, conf);

        job.setNumReduceTasks(3);                       //reduce个数
        job.waitForCompletion(true);
    }
}

运行结果

//cat part-r-00001
石家庄房价太特么高了	8
//cat part-r-00002
北京房价便宜	9

使用方式

点开ChainMappeer的源码，官方注释中有：

<p>
 * Using the ChainMapper and the ChainReducer classes is possible to compose
//下面那句话正则的方式说明map要1一个或一个以上，reduce的map要0个或多个
 * Map/Reduce jobs that look like <code>[MAP+ / REDUCE MAP*]</code>. And
 * immediate benefit of this pattern is a dramatic reduction in disk IO.
 * </p>
 * <p>
 * IMPORTANT: There is no need to specify the output key/value classes for the
 * ChainMapper, this is done by the addMapper for the last mapper in the chain.
 * </p>
 * ChainMapper usage pattern:
 * <p>
 * 
 * <pre>
 * ...
 * Job = new Job(conf);
 *
 * Configuration mapAConf = new Configuration(false);
 * ...
 * ChainMapper.addMapper(job, AMap.class, LongWritable.class, Text.class,
 *   Text.class, Text.class, true, mapAConf);
 *
 * Configuration mapBConf = new Configuration(false);
 * ...
//第二个map的输入时第一个的输出，每个map都可以有自己的config
 * ChainMapper.addMapper(job, BMap.class, Text.class, Text.class,
 *   LongWritable.class, Text.class, false, mapBConf);
 *
 * ...
 *
 * job.waitForComplettion(true);
 * ...

场景说明：

mapper1,word->[word,1]

mapper2 过滤出含有“房价”的词

reduce 统计

reduceMapper 过滤掉5次以下的统计结果

运行主函数app

运行结果

使用方式

猜你喜欢