MapReduce编程模型4——Reducer简介

概述

Reducer 用以将共享同一个 key 的一组中间值缩减为一组较小的值。（Reduces a set of intermediate values which share a key to a smaller set of values.）

Reducer 的实现类可以通过 JobContext.getConfiguration() 方法来访问的Job的配置对象 Configuration 。

Reducer 有三个主要的阶段：

1.Shuffle

Reducer 通过使用 HTTP 从每一个mapper拷贝排序后的中间输出；

2.Sort

MR框架会按照key对Reducer输入进行排序合并。

Shuffle 和 Sort 这两个阶段是同时发生的，即当获取中间输出的时候，他们就会被合并。

SecondarySort

为了对value Iterator的返回值进行二次排序，应用程序应该使用secondary key对key进行扩展，同时定义一个分组比较器（grouping comparator）。key 将会使用整个key进行排序，但会使用grouping comparator进行分组，从而决定在同一个调用中发送哪些key和value进行reduce操作。grouping comparator 通过使用 Job.setGroupingComparatorClass(Class) 来指定，排序则由Job.setSortComparatorClass(Class)控制。

3. Reduce

在这个阶段，每一个排序后的inpu t都会调用reduce(Object, Iterable, org.apache.hadoop.mapreduce.Reducer.Context) 方法。reduce 任务的输出通常是通过调用 TaskInputOutputContext.write(Object, Object) 方法写入一个 RecordWriter。

Reducer的输出不会再进行排序了。

下面是一个实现Reducer 的实例代码：

public class IntSumReducer<Key> extends Reducer<Key,IntWritable,
                                                 Key,IntWritable> {
   private IntWritable result = new IntWritable();
 
   public void reduce(Key key, Iterable<IntWritable> values,
                      Context context) throws IOException, InterruptedException {
     int sum = 0;
     for (IntWritable val : values) {
       sum += val.get();
     }
     result.set(sum);
     context.write(key, result);
   }
 }

Reducer 类

Reducer类的定义如下：

public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {

  public abstract class Context 
    implements ReduceContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
  }
  
  protected void setup(Context context
                       ) throws IOException, InterruptedException {
    // NOTHING
  }

  @SuppressWarnings("unchecked")
  protected void reduce(KEYIN key, Iterable<VALUEIN> values, Context context
                        ) throws IOException, InterruptedException {
    for(VALUEIN value: values) {
      context.write((KEYOUT) key, (VALUEOUT) value);
    }
  }
  
  protected void cleanup(Context context
                         ) throws IOException, InterruptedException {
    // NOTHING
  }

  public void run(Context context) throws IOException, InterruptedException {
    setup(context);
    try {
      while (context.nextKey()) {
        reduce(context.getCurrentKey(), context.getValues(), context);
        // If a back up store is used, reset it
        Iterator<VALUEIN> iter = context.getValues().iterator();
        if(iter instanceof ReduceContext.ValueIterator) {
          ((ReduceContext.ValueIterator<VALUEIN>)iter).resetBackupStore();        
        }
      }
    } finally {
      cleanup(context);
    }
  }
}

和Mapper一样，Reducer也有四个方法，setup、reduce、cleanup、run。其功能也类似，setup方法会在reduce task启动时调用，cleanup是在reduce task执行后调用，reduce方法是用以对每个key进行处理的方法，也是Reducer类的最核心的方法，多数的应用程序，在实现Reducer的功能时，都是重写该方法。而run方法是将前面三个方法组织起来从而控制这个reduce task如何工作。应用程序在执行Reducer的时候，就是调用了run方法。

需要说明的是，setup和cleanup方法默认是空方法，如果我们的应用场景有特殊要求的时候，可以重写这两个方法来满足需求。

参考：

http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/Reducer.html

https://blog.csdn.net/gamer_gyt/article/details/47338053

MapReduce编程模型4——Reducer简介

概述

Reducer 类

猜你喜欢