7.MR核心_Mapper_Reducer

1.Mapper

在上一节我们看到了FileInputFormat把数据进行分片，然后提供RecordReader给Mapper。我们看看Mapper会如何处理：

/** 
 * Mappers是多个将input records转换为中间临时records的单个任务。
 * 转换后的中间records不需要与input recordsx类型相同。
 * 一个给定的<k,v>可能映射到0或多个<k,v>。
 * 
 * MR框架通过InputFormat为每个Mapper 任务生成一个InputSplit。Mapper通过JobContext.getConfiguration()访问Job的配置。
 * MR框架会先调用且仅调用一次setup(...)方法，紧接着为每个<k,v>执行map(...)方法，最后调用cleanup(...)方法。
 * 
 * 与一个指定Key相关的所有中间值Value紧接着将会被MR框架分成一组。传给一个Reducer取决定最后输出。
 * 用户可以通过分别指定两个RawComparator比较器来控制Key的排序和分组。 
 * 
 * Mappers的输出被分区到每个Reducer。用户能通过自定义一个Partitioner控制哪些Key被分配到哪个Partitioner中。
 * 
 *
 * 用户可以选择通过Job.setCombinerClass(...)指定一个"聚合输出器"实现本地聚合，有助于减少从Mapper到Reducer的数据传输量。
 * 
 * 用户可以指定应用是否以及如何压缩Mapper中间输出，以及使用哪些CompressionCodecs配置压缩类型。
 *  
 * 如果Job没有Reducer，那么Mapper的输出将直接写入OutputFormat，无需按Key排序。
 * 
 */
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {

  // Context实现与Mappers上下文传递
  public abstract class Context
    implements MapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
  }
  
  /**
   * Task任务开启时，仅调用一次
   */
  protected void setup(Context context
                       ) throws IOException, InterruptedException {
    // NOTHING
  }

  /**
   * Input Split中每个<k,v>都要调用一次map(...)方法。该方法经常需要重写。
  @SuppressWarnings("unchecked")
  protected void map(KEYIN key, VALUEIN value, 
                     Context context) throws IOException, InterruptedException {
    context.write((KEYOUT) key, (VALUEOUT) value);
  }

  /**
   * Task任务结束仅调用一次
   */
  protected void cleanup(Context context
                         ) throws IOException, InterruptedException {
    // NOTHING
  }
  
  /**
   * 熟练的用户会重写该方法，更完全的控制Mapper执行
   */
  public void run(Context context) throws IOException, InterruptedException {
    setup(context);
    try {
      while (context.nextKeyValue()) {
        map(context.getCurrentKey(), context.getCurrentValue(), context);
      }
    } finally {
      cleanup(context);
    }
  }
}

翻译完Mapper类的介绍，发现我们通过各种学习途径了解到的Mapper流程，实际在源码类说明上已经说的清清楚楚了，茅塞顿开。我们看到map(...)方法中每次得到一个KV Pairs：<context.getCurrentKey(),context,getCurrentValue()>。该KV Pairs是FileInputFormat子类的RecordReader提供的，因为该方法在FileInputFormat中没有实现。因此，我们去看FileInputFormat的子类KeyValueTextInputFormat。

/**
 * 一种用于纯文本文件的InputFormat。文件被分成多行，每行以回车和换行结束。每行被分隔符分成key和value。
 * 如果没有这样的分隔符字节，key是整行，value是null。
 * 分隔符可以通过配置文件的mapreduce.input.keyvaluelinerecordreader.key.value.separator指定，默认是tab。
 * 
public class KeyValueTextInputFormat extends FileInputFormat<Text, Text> {

  @Override
  protected boolean isSplitable(JobContext context, Path file) {
    final CompressionCodec codec =
      new CompressionCodecFactory(context.getConfiguration()).getCodec(file);
    if (null == codec) {
      return true;
    }
    return codec instanceof SplittableCompressionCodec;
  }

  public RecordReader<Text, Text> createRecordReader(InputSplit genericSplit,
      TaskAttemptContext context) throws IOException {
    
    context.setStatus(genericSplit.toString());
    return new KeyValueLineRecordReader(context.getConfiguration());
  }

}

在KeyValueTextInputFormat#createRecordReader(InputSplit genericSplit,TaskAttemptContext context)方法中我们看到，在构造KeyValueLineRecordReader的时候，传入了TaskAttemptContext。Mapper中的Context实现了MapContext —> TaskInputOutputContext —> TaskAttemptContext。因此，Mapper#map(context.getCurrentKey(),context.getCurrentValue())来自于KeyValueTextInputFormat#createRecordReader(...)方法返回的KeyValueLineRecordReader。

2.Reducer

/** 
 * 减少具有相同Key的Values的集合
 * 
 * Reducer可以通过JobContext#getConfiguration()方法访问Job的配置

 * Reducer有三个阶段：
 * 1. Shuffle。Reducer通过网络HTTP将Mapper排序后的输出Copy到Mapper端。
 * 2. Sort。框架合并不同的输入，根据Key进行排序。
 *    SecondarySort。如果要对Value进行二次排序，那么应该对key进行扩展。扩展成NewKey{key1,key2}。
 *        然后，指定一个分组比较器。Key将使用NewKey进行排序，但分组要使用分组比较器进行分组。
 *        以决定发送哪些<Key,value>到同一个Reduce调用。分组比较器通过Job#setGroupingComparatorClass(...)指定。
 *        排序通过Job#setSortComparatorClass(...)控制。 
 * 3. Reduce。reduce()方法的输入是<key,[values]>，如果是二次排序的话，values是有序的。     
 *   
 */

public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {

  // Reducer通过Context实现上下文传递
  public abstract class Context 
    implements ReduceContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
  }

  /**
   * Reducer任务开始的时候调用一次
   */
  protected void setup(Context context
                       ) throws IOException, InterruptedException {
    // NOTHING
  }

  /**
   * 对于每个Key,调用一次reduce()方法
   */
  @SuppressWarnings("unchecked")
  protected void reduce(KEYIN key, Iterable<VALUEIN> values, Context context
                        ) throws IOException, InterruptedException {
    for(VALUEIN value: values) {
      context.write((KEYOUT) key, (VALUEOUT) value);
    }
  }

  /**
   * 一个任务最后只调用一次
   */
  protected void cleanup(Context context
                         ) throws IOException, InterruptedException {
    // NOTHING
  }

  
  public void run(Context context) throws IOException, InterruptedException {
    setup(context);
    try {
      while (context.nextKey()) {
        reduce(context.getCurrentKey(), context.getValues(), context);
        // If a back up store is used, reset it
        Iterator<VALUEIN> iter = context.getValues().iterator();
        if(iter instanceof ReduceContext.ValueIterator) {
          ((ReduceContext.ValueIterator<VALUEIN>)iter).resetBackupStore();        
        }
      }
    } finally {
      cleanup(context);
    }
  }
}

Reducer过程中，包含了shuffle、merge、sort、group、reduce、output。先将mapper output拷贝到reduce端，然后进行合并及排序(归并排序)。之后，对相同key的value进行分组，进入reduce()方法中处理，并输出。二次排序见前面的博文详解。

好了，这就是Mapper的执行逻辑了。后续，再分析整个MapTask - Shuffle - ReduceTask过程。

7.MR核心_Mapper_Reducer

猜你喜欢