Two, MapReduce programming specifications substantially

[TOC]

A, MapReduce programming basic components

Writing MapReduce programs have at least three essential parts: mapper, reducer, driver. Optionally there partitioner, combiner
input and output of the mapper, reducer input and output are the key value type, it requires us to write mapper and reducer, must realize clearly that four key 8 kinds of data types, and must also hadoop serializable type. Also note that the output of the map is actually reduce the input, so the included data type is the same.

1, map stage

The basic flow of the preparation
1) map custom class, this class needs to inherit Mapper
type 2) inherited Mapper when the need to specify inputs and outputs of the key
3) must be overridden inherited from the parent class map () Method
4) rewriting the above map () method for each map task to each of the input key-value pairs mapper will be called a process.

Basic written examples are as follows:

/*
指定Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> 这4个类型分别为:
LongWritable, Text, Text, IntWritable,相当于普通类型:
long,string,string,int
*/
public class TestMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        这里是map方法 处理逻辑
    }
}

2, reduce stage

Which was prepared in Process
1) reduce custom class, this class needs to inherit Reducer
Type 2) Reducer inherited when the need to specify inputs and outputs of the key
3) must be rewritten reduce inherited from the parent class () Method
4) the above rewriting reduce () method of each reduer task reducer key-value pairs will be processed once for each call to the input.

Basic written examples are as follows:

/*
指定Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> 这4个类型分别为:
Text, IntWritable, Text, IntWritable,相当于普通类型:
string,int,string,int
*/
public class TestReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    protected void reduce(Text key,
                          Iterable<IntWritable> values,
                          Context context) throws IOException, InterruptedException {
        这里是reduce方法 处理逻辑
    }
}

3, driver stage

This section is used to configure the various objects must job configuration information, the configuration is complete, the job submitted to the yarn performed
directly on a specific configuration example of what optimistic below. Play a major role in the scheduling map and reduce task execution

4, partitioner stage

This stage is mainly the output stage to partition map, and the number of the partition map reduce task directly determines the number of (generally a one to one), the preparation process is as follows:
1) Custom partition class that inherits Partitioner <key, value >
key 2) Partitioner succession when processing input type
3) must be rewritten getPartition inherited from the parent class () method
4) can be rewritten above getPartition () () method for each input of each maptask the key will be called to deal with once.
5) The partitioning rule returns 0 ~ n, represents the partition as 0 ~ n

Preparation of case as follows:

public class WordCountPartitioner extends Partitioner<Text, IntWritable> {
    @Override
    public int getPartition(Text text, IntWritable intWritable, int i) {
        判断条件1:
        return 0;
        判断条件2:
        return 1;
        .......
        return n;
    }
}

5、combiner

combiner is not a separate stage, it is actually contained in the map stage. Keys of the map itself, the output of each of the key value is 1, even if the same key, is also a separate pairs. If a duplicate key value The more, the process will reduce the output to a map, it will take up a lot of bandwidth resources. Optimization is to map each output, the first combined local map Task aggregated under current, may reduce the recurring. which is

<king,1> <>king,1>  这种一样的key的,就会合并成 <king,2>
这样就会减少传输的数据量

So in fact this you can see, in fact, operate and reduce the combiner operation is the same, but one local and one global. The approach is simple, direct reducer as a combiner incoming class job, such as:

job.setCombinerClass(WordCountReducer.class);

We can look at the source of this method:

public void setCombinerClass(Class<? extends Reducer> cls) throws IllegalStateException {
        this.ensureState(Job.JobState.DEFINE);
        //看到没,那个  Reducer.class
        this.conf.setClass("mapreduce.job.combine.class", cls, Reducer.class);
    }

You can be clearly seen when setting combine class, see polymorphic type Reducer type is provided, there may be more from the same combiner is determined and the operation of the reducer.

Two, wordcount programming examples

Let's start with a wordcount written as a complete example MapReduce programs

1、mapper

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

    //setup 和 clean 方法不是必须的
    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
        //最先执行
        //System.out.println("this is setup");
    }

    @Override
    protected void cleanup(Context context) throws IOException, InterruptedException {
        //执行完map之后执行
        //System.out.println("this is cleanup");
    }

    //这里创建一个临时对象,用于保存中间值
    Text k = new Text();
    IntWritable v = new IntWritable();

    /**
     *
     *
     * @param key
     * @param value
     * @param context  用于连接map和reduce上下文,通过这个对象传递map的结果给reduce
     * @throws IOException
     * @throws InterruptedException
     */
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        //System.out.println("开始map=====================");

        //1.value是读取到的一行字符串,要将其转换为java中的string进行处理,即反序列化
        String line = value.toString();

        //2.切分数据
        String[] words = line.split(" ");

        //3.输出map结构, <单词,个数>的形式,写入的时候需将普通类型转为序列化类型
        /**
         * 两种写法:
         * 1) context.write(new Text(word), new IntWritable(1));
         *     缺点:每次都会创建两个对象,最后会造成创建了很多临时对象
         *
         * 2)Text k = new Text();
         *    IntWritable v = new IntWritable();
         *
         *    for {
         *       k.set(word);
         *       v.set(1);
         *       context.write(k, v);
         *    }
         *
         *    这种方法好处就是,对象只创建了一次,后续只是通过修改对象内部的值的方式传递,无需重复创建多个对象
         */
        for (String word:words) {
            //转换普通类型为可序列化类型
            k.set(word);
            v.set(1);
            //写入到上下文对象中
            context.write(k, v);
        }
    }
}

2、reducer

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

    /**
     * 这里的 Iterable<IntWritable> values 之所以是一个可迭代的对象,
     * 是因为从map传递过来的数据经过合并了,如:
     * (HDFS,1),(HDFS,1)合并成 (HDFS,[1,1]) 这样的形式,所以value可以通过迭代方式获取其中的值
     *
     */
    IntWritable counts = new IntWritable();

    @Override
    protected void reduce(Text key,
                          Iterable<IntWritable> values,
                          Context context) throws IOException, InterruptedException {
        //1.初始化次数
        int count = 0;

        //2.汇总同一个key中的个数
        for (IntWritable value: values) {
            count += value.get();
        }

        //3.输出reduce
        counts.set(count);
        context.write(key, counts);
    }
}

3、driver

public class WordCountDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        //这里只是方便在ide下直接运行,如果是在命令行下直接输入输入和输出文件路径即可
        args = new String[]{"G:\\test2\\", "G:\\testmap6\\"};

        //1.获取配置对象
        Configuration conf = new Configuration();

        //2.获取job对象
        Job job = Job.getInstance(conf);

        //3.分别给job指定driver,map,reducer的类
        job.setJarByClass(WordCountDriver.class);
        job.setMapperClass(WordCountMapper.class);
        job.setReducerClass(WordCountReducer.class);

        //4.分别指定map和reduce阶段输出的类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

         //这里可以设置分区类,需要额外编写分区实现类
//        job.setPartitionerClass(WordCountPartitioner.class);
//        job.setNumReduceTasks(2);

        //设置预合并类
        //job.setCombinerClass(WordCountReducer.class);

        //设置inputFormat类,大量小文件优化,不设置默认使用 TextInputFormat
        job.setInputFormatClass(CombineTextInputFormat.class);
        CombineTextInputFormat.setMaxInputSplitSize(job,3* 1024 * 1024);
        CombineTextInputFormat.setMinInputSplitSize(job, 2 * 1024 * 1024);

        //5.数据输入来源以及结果的输出位置
        // 输入的时候会根据数据源的情况自动map切片,形成切片信息(或者叫切片方案)
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        //以上就是将一个job的配置信息配置完成后,下面就提交job,hadoop将跟就job的配置执行job

        //6.提交job任务,这个方法相当于 job.submit()之后,然后等待执行完成
        //任务配置信息是提交至yarn的  MRappmanager
        job.waitForCompletion(true);

    }
}

Guess you like

Origin blog.51cto.com/kinglab/2445038