MapReduce 特性（ MapReduce Features ）

MapReduce 特性（ MapReduce Features ）
-------------------------------------------------------------------------------------------------------------------------------------------------------
本章讨论 MapReduce 的一些高级特性，包括计数器、排序和链接数据集（ ddatasets）。

1   计数器（ Counters ）
-------------------------------------------------------------------------------------------------------------------------------------------------------
计数器是收集作业统计信息的有效手段：用于质量控制或应用级别的统计。它们对问题诊断也有帮助。如果冒险用一个日志消息记录 map 或 reduce 任务，
最好看看是否能用个计数器替代来记录某个特定的状况发生。除了在大型分布式作业上，计数器值比日志输出更容易获取到之外，获取某个事件发生次数，计数器只需一条记录，
而从一堆日志文件中得到则需要更多的工作。

1. 内置计数器（Built-in Counters）
--------------------------------------------------------------------------------------------------------------------------------------------------------
Hadoop 为每个作业维护了一些内置的计数器，用以报告多项指标。例如，记录处理的字节数和记录数的计数器，用以确认期望的输入数量处理了，期望的输出数量产生了。

计数器被划分成组，下面几个是内置计数器的组。

                               Built-in counter groups
   +---------------------------+-------------------------------------------------------------------+
   |           组               |                   Name/Enum                                       |
   +---------------------------+-------------------------------------------------------------------+
   | MapReduce task counters   | org.apache.hadoop.mapreduce.TaskCounter                           |
   +---------------------------+-------------------------------------------------------------------+
   | Filesystem counters       | org.apache.hadoop.mapreduce.FileSystemCounter                       |
   + --------------------------+-------------------------------------------------------------------+
   | FileInputFormat counters   | org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter       |
   +---------------------------+-------------------------------------------------------------------+
   | FileOutputFormat counters   | org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCounter   |
   +---------------------------+-------------------------------------------------------------------+
   | Job counters               | org.apache.hadoop.mapreduce.JobCounter                           |
   +---------------------------+-------------------------------------------------------------------+

每个组或者包含任务计数器（ task counters，任务处理过程中更新）   或者包含作业计数器（ job counters , 作业处理过程中更新）。

   任务计数器（Task counters）
   -------------------------------------------------------------------------------------------------------------------------------------------------------
   任务计数器在任务执行的整个过程中收集任务相关信息，其结果是聚集作业中的所有任务。例如，MAP_INPUT_RECORDS 计数器，计数每个 map 任务读取的输入记录，并聚集
   作业中的所有 map 任务，   因此它的最终结果是整个作业的输入记录总数。

   任务计数器由每个任务尝试（ task attempt ）维护，并且定期发送给 application master ，因此它们能被全局聚集。任务计数器每次是全值(sent in full every time)
   发送的，而非发送自上次传输以来的计数值，这样避免由于丢失消息发生错误。此外，在作业运行期间，如果任务失败计数器可能停止(go down).

   只有当一个作业成功完成时计数器的值才最终确定。然而，有些计数器在任务处理过程中提供有用的诊断信息，并且使用 web UI 监视它们很有用。例如，
   PHYSICAL_MEMORY_BYTES ，VIRTUAL_MEMORY_BYTES 和 COMMITTED_HEAP_BYTES 提供了一个指示，一个特定的任务尝试在执行过程中内存使用如何变化。

   内置任务计数器包括在 MapReduce 任务计数器组， file-related 计数器组。

                                       Built-in MapReduce task counters

   +---------------------------------------+-------------------------------------------------------------------------------------------+
   |           计数器                       |                       描述                                                               |
   +---------------------------------------+-------------------------------------------------------------------------------------------+
   | Map input records                       | The number of input records consumed by all the maps in the job. Incremented               |
   |(MAP_INPUT_RECORDS)                   | every time a record is read from a RecordReader and passed to the map’s map()               |
   |                                       | method by the framework.                                                                   |
   +---------------------------------------+-------------------------------------------------------------------------------------------+
   | Split raw bytes                        | The number of bytes of input-split objects read by maps. These objects represent           |
   |(SPLIT_RAW_BYTES)                       | the split metadata (that is, the offset and length within a file) rather than the split   |
   |                                       | data itself, so the total size should be small.                                           |
   +---------------------------------------+-------------------------------------------------------------------------------------------+
   | Map output records                   | The number of map output records produced by all the maps in the job.                       |
   |(MAP_OUTPUT_RECORDS)                   | Incremented every time the collect() method is called on a map’s OutputCollector.           |
   +---------------------------------------+-------------------------------------------------------------------------------------------+
   | Map output bytes                       | The number of bytes of uncompressed output produced by all the maps in the job.           |
   |(MAP_OUTPUT_BYTES)                       | Incremented every time the collect() method is called on a map’s OutputCollector.           |
   +---------------------------------------+-------------------------------------------------------------------------------------------+
   | Map output materialized bytes           | The number of bytes of map output actually written to disk. If map output                   |
   |(MAP_OUTPUT_MATERIALIZED_BYTES)       | compression is enabled, this is reflected in the counter value.                           |
   +---------------------------------------+-------------------------------------------------------------------------------------------+
   | Combine input records                   | The number of input records consumed by all the combiners (if any) in the job.           |
   |(COMBINE_INPUT_RECORDS)               | Incremented every time a value is read from the combiner’s iterator over values.           |
   |                                       | Note that this count is the number of values consumed by the combiner, not the           |
   |                                       | number of distinct key groups (which would not be a useful metric, since there is           |
   |                                       | not necessarily one group per key for a combiner.                                           |
   +---------------------------------------+-------------------------------------------------------------------------------------------+
   | Combine output records               | The number of output records produced by all the combiners (if any) in the job.           |
   |(COMBINE_OUTPUT_RECORDS)               | Incremented every time the collect() method is called on a combiner’s OutputCollector.   |
   +---------------------------------------+-------------------------------------------------------------------------------------------+
   | Reduce input groups                   | The number of distinct key groups consumed by all the reducers in the job.               |
   |(REDUCE_INPUT_GROUPS)                   | Incremented every time the reducer’s reduce() method is called by the framework.           |
   +---------------------------------------+-------------------------------------------------------------------------------------------+
   | Reduce input records                   | The number of input records consumed by all the reducers in the job.                       |
   |(REDUCE_INPUT_RECORDS)                   | Incremented every time a value is read from the reducer’s iterator over values. If       |
   |                                       | reducers consume all of their inputs, this count should be the same as the count           |
   |                                       | for map output records.                                                                   |
   +---------------------------------------+-------------------------------------------------------------------------------------------+
   | Reduce output records                   | The number of reduce output records produced by all the reduces in the job.               |
   |(REDUCE_OUTPUT_RECORDS)               | Incremented every time the collect() method is called on a reducer’s OutputCollector.       |
   +---------------------------------------+-------------------------------------------------------------------------------------------+
   | Reduce shuffle bytes                   | The number of bytes of map output copied by the shuffle to reducers.                       |
   |(REDUCE_SHUFFLE_BYTES)                   |                                                                                           |
   +---------------------------------------+-------------------------------------------------------------------------------------------+
   | Spilled records (SPILLED_RECORDS)       | The number of records spilled to disk in all map and reduce tasks in the job.               |
   +---------------------------------------+-------------------------------------------------------------------------------------------+
   | CPU milliseconds                       | The cumulative CPU time for a task in milliseconds, as reported by                       |
   |(CPU_MILLISECONDS)                       | /proc/cpuinfo.                                                                           |
   +---------------------------------------+-------------------------------------------------------------------------------------------+
   | Physical memory bytes                   | The physical memory being used by a task in bytes, as reported by                           |
   |(PHYSICAL_MEMORY_BYTES)               | /proc/meminfo.                                                                           |
   +---------------------------------------+-------------------------------------------------------------------------------------------+
   | Virtual memory bytes                   | The virtual memory being used by a task in bytes, as reported by /proc/meminfo.           |
   |(VIRTUAL_MEMORY_BYTES)                   |                                                                                           |
   +---------------------------------------+-------------------------------------------------------------------------------------------+
   | Committed heap bytes                   | The total amount of memory available in the JVM in bytes, as reported by                   |
   |(COMMITTED_HEAP_BYTES)                   | Runtime.getRuntime().totalMemory().                                                       |
   +---------------------------------------+-------------------------------------------------------------------------------------------+
   | GC time milliseconds                   | The elapsed time for garbage collection in tasks in milliseconds, as reported by           |
   |(GC_TIME_MILLIS)                       | GarbageCollectorMXBean.getCollectionTime().                                               |
   +---------------------------------------+-------------------------------------------------------------------------------------------+
   | Shuffled maps (SHUFFLED_MAPS)           | The number of map output files transferred to reducers by the shuffle                       |
   +---------------------------------------+-------------------------------------------------------------------------------------------+
   | Failed shuffle (FAILED_SHUFFLE)       | The number of map output copy failures during the shuffle.                               |
   +---------------------------------------+-------------------------------------------------------------------------------------------+
   | Merged map outputs                   | The number of map outputs that have been merged on the reduce side of the shuffle.       |
   |(MERGED_MAP_OUTPUTS)                   |                                                                                           |
   +---------------------------------------+-------------------------------------------------------------------------------------------+

                                       Built-in filesystem task counters

   +---------------------------------------+-------------------------------------------------------------------------------------------+
   |       计数器                           |                       描述                                                               |
   +---------------------------------------+-------------------------------------------------------------------------------------------+
   | Filesystem bytes read                   | The number of bytes read by the filesystem by map and reduce tasks. There is a counter for|
   |(BYTES_READ)                           | each filesystem, and Filesystem may be Local, HDFS, S3, etc.                               |
   +---------------------------------------+-------------------------------------------------------------------------------------------+
   | Filesystem bytes written               | The number of bytes written by the filesystem by map and reduce tasks.                   |
   |(BYTES_WRITTEN)                       |                                                                                           |
   +---------------------------------------+-------------------------------------------------------------------------------------------+
   | Filesystem read ops                   | The number of read operations (e.g., open, file status) by the filesystem by map and       |
   |(READ_OPS)                               | reduce tasks.                                                                               |
   +---------------------------------------+-------------------------------------------------------------------------------------------+
   | Filesystem large read ops               | The number of large read operations (e.g., list directory for a large directory) by the   |
   |(LARGE_READ_OPS)                       | filesystem by map and reduce tasks.                                                       |
   +---------------------------------------+-------------------------------------------------------------------------------------------+
   | Filesystem write ops                   | The number of write operations (e.g., create, append) by the filesystem by map and reduce   |
   |(WRITE_OPS)                           | tasks.                                                                                   |
   +---------------------------------------+-------------------------------------------------------------------------------------------+

                                       FileInputFormat task counters

   +---------------------------------------+-------------------------------------------------------------------------------------------+
   |       计数器                           |                       描述                                                               |
   +---------------------------------------+-------------------------------------------------------------------------------------------+
   | Bytes read (BYTES_READ)               | The number of bytes read by map tasks via the FileInputFormat.                           |
   +---------------------------------------+-------------------------------------------------------------------------------------------+



                                       Built-in FileOutputFormat task counters

   +---------------------------------------+-------------------------------------------------------------------------------------------+
   |       计数器                           |                       描述                                                               |
   +---------------------------------------+-------------------------------------------------------------------------------------------+
   | Bytes written                           | The number of bytes written by map tasks (for map-only jobs) or reduce tasks via the       |
   |(BYTES_WRITTEN)                       | FileOutputFormat.                                                                           |
   +---------------------------------------+-------------------------------------------------------------------------------------------+




   作业计数器（Job counters）
   -------------------------------------------------------------------------------------------------------------------------------------------------------
   作业计数器由 application master 维护，因此不像其他计数器，包括用户定义（ user-defined ）的计数器，它们不需要通过网络发送。
   它们度量作业级别的统计信息，其值不会在任务运行时发生变化。例如， TOTAL_LAUNCHED_MAPS 计数作业期间启动的 map 任务的数量（包括失败的任务）。



                                       Built-in job counters

   +---------------------------------------+---------------------------------------------------------------------------------------------------+
   |       计数器                           |                           描述                                                                   |
   +---------------------------------------+---------------------------------------------------------------------------------------------------+
   | Launched map tasks                   | The number of map tasks that were launched. Includes tasks that were started speculatively       |
   |(TOTAL_LAUNCHED_MAPS)                   |                                                                                                    |
   +---------------------------------------+---------------------------------------------------------------------------------------------------+
   | Launched reduce tasks                   | The number of reduce tasks that were launched. Includes tasks that were started speculatively.   |
   |(TOTAL_LAUNCHED_REDUCES)               |                                                                                                   |
   +---------------------------------------+---------------------------------------------------------------------------------------------------+
   | Launched uber tasks                   | The number of uber tasks (see Anatomy of a MapReduce Job Run) that were launched.                   |
   |(TOTAL_LAUNCHED_UBERTASKS)               |                                                                                                   |
   +---------------------------------------+---------------------------------------------------------------------------------------------------+
   | Maps in uber tasks                   | The number of maps in uber tasks.                                                                   |
   |(NUM_UBER_SUBMAPS)                       |                                                                                                   |
   +---------------------------------------+---------------------------------------------------------------------------------------------------+
   | Reduces in uber tasks                   | The number of reduces in uber tasks.                                                               |
   | (NUM_UBER_SUBREDUCES)                   |                                                                                                   |
   +---------------------------------------+---------------------------------------------------------------------------------------------------+
   | Failed map tasks                       | The number of map tasks that failed. See Task Failure for potential causes.                       |
   |(NUM_FAILED_MAPS)                       |                                                                                                   |
   +---------------------------------------+---------------------------------------------------------------------------------------------------+
   | Failed reduce tasks                   | The number of reduce tasks that failed.                                                           |
   |(NUM_FAILED_REDUCES)                   |                                                                                                    |
   +---------------------------------------+---------------------------------------------------------------------------------------------------+
   | Failed uber tasks                       | The number of uber tasks that failed.                                                               |
   |(NUM_FAILED_UBERTASKS)                   |                                                                                                   |
   +---------------------------------------+---------------------------------------------------------------------------------------------------+
   | Killed map tasks                       | The number of map tasks that were killed. See Task Failure for potential causes.                   |
   |(NUM_KILLED_MAPS)                       |                                                                                                   |
   +---------------------------------------+---------------------------------------------------------------------------------------------------+
   | Killed reduce tasks                   | The number of reduce tasks that were killed.                                                       |
   |(NUM_KILLED_REDUCES)                   |                                                                                                   |
   +---------------------------------------+---------------------------------------------------------------------------------------------------+
   | Data-local map tasks                   | The number of map tasks that ran on the same node as their input data.                           |
   |(DATA_LOCAL_MAPS)                       |                                                                                                   |
   +---------------------------------------+---------------------------------------------------------------------------------------------------+
   | Rack-local map tasks                   | The number of map tasks that ran on a node in the same rack as their input data, but               |
   |(RACK_LOCAL_MAPS)                       | were not data-local.                                                                               |
   +---------------------------------------+---------------------------------------------------------------------------------------------------+
   | Other local map tasks                   | The number of map tasks that ran on a node in a different rack to their input data. Interrack       |
   |(OTHER_LOCAL_MAPS)                       | bandwidth is scarce, and Hadoop tries to place map tasks close to their input data,               |
   |                                       | so this count should be low.                                                                       |
   +---------------------------------------+---------------------------------------------------------------------------------------------------+
   | Total time in map tasks               | The total time taken running map tasks, in milliseconds. Includes tasks that were                   |
   |(MILLIS_MAPS)                           | started speculatively. See also corresponding counters for measuring core and memory               |
   |                                       | usage (VCORES_MILLIS_MAPS and MB_MILLIS_MAPS).                                                   |
   +---------------------------------------+---------------------------------------------------------------------------------------------------+
   | Total time in reduce tasks           | The total time taken running reduce tasks, in milliseconds. Includes tasks that were               |
   |(MILLIS_REDUCES)                       | started speculatively. See also corresponding counters for measuring core and memory               |
   |                                       | usage (VCORES_MILLIS_REDUCES and MB_MILLIS_REDUCES).                                               |
   +---------------------------------------+---------------------------------------------------------------------------------------------------+


2. 用户定义的 Java 计数器（User-Defined Java Counters）
--------------------------------------------------------------------------------------------------------------------------------------------------------
MapReduce 允许用户编写代码来定义计数器，然后在 mapper 或 reducer 里按需递增。计数器定义为 Java 枚举类型，提供分组相关的计数器。
一个作业可以定义任意多的枚举，每个枚举可以定义任意多的字段。枚举名作为组名，枚举字段作为计数器名。
计数器是全局的： MapReduce framework 跨所有的 map 和 reduce 聚集这些计数器，并在作业结束时以产生一个最终结果。

   //Application to run the maximum temperature job, including counting missing and malformed fields and quality codes

   public class MaxTemperatureWithCounters extends Configured implements Tool {
       enum Temperature {
           MISSING,
           MALFORMED
       }

       static class MaxTemperatureMapperWithCounters
               extends Mapper<LongWritable, Text, Text, IntWritable> {

           private NcdcRecordParser parser = new NcdcRecordParser();

           @Override
           protected void map(LongWritable key, Text value, Context context)
                   throws IOException, InterruptedException {

               parser.parse(value);
               if (parser.isValidTemperature()) {
                   int airTemperature = parser.getAirTemperature();
                   context.write(new Text(parser.getYear()),
                   new IntWritable(airTemperature));
               } else if (parser.isMalformedTemperature()) {
                   System.err.println("Ignoring possibly corrupt input: " + value);
                   context.getCounter(Temperature.MALFORMED).increment(1);
               } else if (parser.isMissingTemperature()) {
                   context.getCounter(Temperature.MISSING).increment(1);
               }
               // dynamic counter
               context.getCounter("TemperatureQuality", parser.getQuality()).increment(1);
           }
       }
       @Override
       public int run(String[] args) throws Exception {
           Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);
           if (job == null) {
               return -1;
           }
           job.setOutputKeyClass(Text.class);
           job.setOutputValueClass(IntWritable.class);
           job.setMapperClass(MaxTemperatureMapperWithCounters.class);
           job.setCombinerClass(MaxTemperatureReducer.class);
           job.setReducerClass(MaxTemperatureReducer.class);
           return job.waitForCompletion(true) ? 0 : 1;
       }

       public static void main(String[] args) throws Exception {
           int exitCode = ToolRunner.run(new MaxTemperatureWithCounters(), args);
           System.exit(exitCode);
       }
   }

运行：

   % hadoop jar hadoop-examples.jar MaxTemperatureWithCounters \
   input/ncdc/all output-counters

作业成功完成后，在结尾打印出计数器值：

   Air Temperature Records
   Malformed=3
   Missing=66136856
   TemperatureQuality
   0=1
   1=973422173
   2=1246032
   4=10764500
   5=158291879
   6=40066
   9=66136858



   动态计数器（Dynamic counters）
   -----------------------------------------------------------------------------------------------------------------------------------
   上述代码使用了动态计数器 ———— 一个不是由 Java 枚举定义的计数器。由于Java 枚举的字段在编译时必须定义，因而不能用枚举随意地创建计数器。
   Context 对象的 getCounter() 方法用一个字符串类型的组名和一个计数器名可以获得一个动态的计数器。

       public Counter getCounter(String groupName, String counterName)

   这两种创建和访问计数器的方法 ———— 使用枚举和使用字符串 ———— 实际上是相同的，因为在 RPC 上， Hadoop 会将枚举类型转换成字符串发送计数器。
   枚举类型工作起来稍微容易些，提供类型安全，适合大多数作业使用。在一些特殊场景，需要动态创建计数器时，可以使用字符串类型的方法。



   易读的计数器名称
   -----------------------------------------------------------------------------------------------------------------------------------
   计数器的默认名称是枚举类型的 Java 完全限定类名。这种名称在 web 界面和终端上的可读性较差，因此 Hadoop 提供"资源捆绑"(resource bundle)
   这种方式来修改计数器的显示名称。上面的例子即是如此，显示的计数器名称是 Air Temperature Records ，而非 Temperature$MISSING 。对动态计
   数器而言，组名称和计数器名称也用作显示名称，因而通常没有这个问题。

   为计数器提供易读的名称也很容易。以 Java 枚举类型为名创建一个属性文件，用下划线分隔嵌套类。属性文件与包含该枚举类型的顶级类放在同一个
   目录，例如，本例中 Temperature 枚举类型对应的属性文件被命名为：

       MaxTemperatureWithCounters_Temperature.properties

   属性文件应包含一个唯一的 CounterGroupName 属性，其值便是整个组的显示名称。在枚举类型中定义的每个字段均有一个属性与之对应，属性名称是
   "字段名称.name", 属性值是该计数器的显示名称，

   属性文件 MaxTemperatureWithCounters_Temperature.properties 的内容如下：

       CounterGroupName=Air Temperature Records
       MISSING.name=Missing
       MALFORMED.name=Malformed

   Hadoop 使用标准的 Java 本地化机制将正确的属性文件载入到当前运行区域，例如创建一个名为:

       MaxTemperatureWithCounters_Temperature_zh_CN.properties

   中文属性文件，在 zh_CN 区域运行时，就会使用这个属性文件。



   获取计数器（Retrieving counters）
   -----------------------------------------------------------------------------------------------------------------------------------
   除了通过 web 界面和命令行（using mapred job -counter），用户还可以通过 Java API 获取计数器的值。通常情况下，用户一般在作业完成、计
   数器的值已经稳定下来时再获取计数器的值， Java API 支持在作业运行期间也可以获取计数器的值。

   // Application to calculate the proportion of records with missing temperature fields
   import org.apache.hadoop.conf.Configured;
   import org.apache.hadoop.mapreduce.*;
   import org.apache.hadoop.util.*;
   public class MissingTemperatureFields extends Configured implements Tool {

       @Override
       public int run(String[] args) throws Exception {
           if (args.length != 1) {
               JobBuilder.printUsage(this, "<job ID>");
               return -1;
           }

           String jobID = args[0];
           Cluster cluster = new Cluster(getConf());
           Job job = cluster.getJob(JobID.forName(jobID));
           if (job == null) {
               System.err.printf("No job with ID %s found.\n", jobID);
               return -1;
           }
           if (!job.isComplete()) {
               System.err.printf("Job %s is not complete.\n", jobID);
               return -1;
           }

           Counters counters = job.getCounters();
           long missing = counters.findCounter(
           MaxTemperatureWithCounters.Temperature.MISSING).getValue();
           long total = counters.findCounter(TaskCounter.MAP_INPUT_RECORDS).getValue();
           System.out.printf("Records with missing temperature fields: %.2f%%\n",
           100.0 * missing / total);
           return 0;
       }
       public static void main(String[] args) throws Exception {
           int exitCode = ToolRunner.run(new MissingTemperatureFields(), args);
           System.exit(exitCode);
       }
   }

   运行：
       % hadoop jar hadoop-examples.jar MissingTemperatureFields job_1410450250506_0007
   结果：
       Records with missing temperature fields: 5.47%



3. 用户定义的 Streaming 计数器（User-Defined Streaming Counters）
--------------------------------------------------------------------------------------------------------------------------------------------------------
使用 Streaming 的 MapReduce 程序可以向标准错误流发送一行特殊格式的信息来增加计数器的值，这种技术可被视为一种计数器控制手段，信息格式如下：

   reporter:counter:group,counter,amount

以下 Python 代码片段将 Temperature 组的 Missing 计数器增加 1　：

   sys.stderr.write("reporter:counter:Temperature,Missing,1\n")

类似地，状态信息也以一个格式化的行发出：

   reporter:status:message


*
*
*

2   排序（ Sorting ）
-------------------------------------------------------------------------------------------------------------------------------------------------------
数据排序能力是 MapReduce 的核心。尽管你的应用程序本身并不关心排序，仍然可能使用 MapReduce 提供的的排序阶段来组织数据。

1. 准备（Preparation ）
--------------------------------------------------------------------------------------------------------------------------------------------------------
下面按温度值为气象数据集排序。将温度字段存储为 Text 对象排序不能达到此目的，因为温度值是有符号整数不能按词汇顺序排序。我们将把数据存储到顺序文件，
用 IntWritable 类型的 key 表示温度（能正确排序）， Text 类型的 value 表示数据的行内容。

下面的 MapReduce 作业是一个仅有 map 的作业，过滤读取的输入移除无效的温度值。每个 map 创建一个单独的块压缩的顺序文件（ a single block-compressed sequence
file as output ）作为输出。

   // A MapReduce program for transforming the weather data into SequenceFile format

   public class SortDataPreprocessor extends Configured implements Tool {
       static class CleanerMapper
               extends Mapper<LongWritable, Text, IntWritable, Text> {

           private NcdcRecordParser parser = new NcdcRecordParser();

           @Override
           protected void map(LongWritable key, Text value, Context context)
                   throws IOException, InterruptedException {
               parser.parse(value);
               if (parser.isValidTemperature()) {
                   context.write(new IntWritable(parser.getAirTemperature()), value);
               }
           }
       }

       @Override
       public int run(String[] args) throws Exception {
           Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);
           if (job == null) {
               return -1;
           }
           job.setMapperClass(CleanerMapper.class);
           job.setOutputKeyClass(IntWritable.class);
           job.setOutputValueClass(Text.class);
           job.setNumReduceTasks(0);
           job.setOutputFormatClass(SequenceFileOutputFormat.class);
           SequenceFileOutputFormat.setCompressOutput(job, true);
           SequenceFileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
           SequenceFileOutputFormat.setOutputCompressionType(job,
           CompressionType.BLOCK);
           return job.waitForCompletion(true) ? 0 : 1;
       }

       public static void main(String[] args) throws Exception {
           int exitCode = ToolRunner.run(new SortDataPreprocessor(), args);
           System.exit(exitCode);
       }
   }

运行：

   % hadoop jar hadoop-examples.jar SortDataPreprocessor input/ncdc/all \
   input/ncdc/all-seq

2. 部分排序（ Partial Sort ）
--------------------------------------------------------------------------------------------------------------------------------------------------------
默认情况下， MapReduce 根据输入记录的键对数据排序。下面代码是一个变体，它按 IntWritable 类型的 key 排序顺序文件。

   // A MapReduce program for sorting a SequenceFile with IntWritable keys using the default HashPartitioner

   public class SortByTemperatureUsingHashPartitioner extends Configured
           implements Tool {

       @Override
       public int run(String[] args) throws Exception {
           Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);
           if (job == null) {
               return -1;
           }
           job.setInputFormatClass(SequenceFileInputFormat.class);
           job.setOutputKeyClass(IntWritable.class);
           job.setOutputFormatClass(SequenceFileOutputFormat.class);
           SequenceFileOutputFormat.setCompressOutput(job, true);
           SequenceFileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
           SequenceFileOutputFormat.setOutputCompressionType(job,
           CompressionType.BLOCK);
           return job.waitForCompletion(true) ? 0 : 1;
       }

       public static void main(String[] args) throws Exception {
           int exitCode = ToolRunner.run(new SortByTemperatureUsingHashPartitioner(),
           args);
           System.exit(exitCode);
       }
   }

   控制排序次序 (CONTROLLING SORT ORDER)
   ----------------------------------------------------------------------------------------------------------------------------------------------------
   键的排序次序（ The sort order for key ）是由 RawComparator 控制的，基于如下规则：

       1. 如果属性 mapreduce.job.output.key.comparator.class 设置了，或者在 Job 上明确调用 setSortComparatorClass() ，那么会使用那个类的一个实例。
       2. 否则， key 必须是 WritableComparable 的子类，使用为这个 key 类注册的 comparator 。
       3. 如果没有注册的 comparator ，则使用 RawComparator 。 RawComparator 反序列化要比较的字节流到对象内，然后委托给(delegate to) WritableComparable 的
           compareTo() method 。

       上述规则再次强调了为自定义的 Writble 类注册优化版本的 RawComparator 的重要性，也显示了通过设置自己的 comparator 来直接重写排序顺序(the sort order).


假设采用30个 reducer 来运行这个程序：

   % hadoop jar hadoop-examples.jar SortByTemperatureUsingHashPartitioner \
   -D mapreduce.job.reduces=30 input/ncdc/all-seq output-hashsort

这条命令产生30个输出文件，每个都是排序过的。然而，没有一个容易的方法来联合这些文件来产生一个全局排序的文件（ a globally sorted file ）。
对大多数应用来说，这不是个问题。比如说，当要通过 key 进行查找的时候，由一部分排序好的文件就足够了。SortByTemperatureToMapFile 和 LookupRecordsByTemperature
类展示了这种思想。通过使用 map 文件代替顺序文件，首先查找到一个 key 所属的相关分区（使用 partitioner ），然后在 map 文件的分区内高效地查找记录。


   public class SortByTemperatureToMapFile extends Configured implements Tool {

      @Override
      public int run(String[] args) throws Exception {
       Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);
       if (job == null) {
          return -1;
       }

       job.setInputFormatClass(SequenceFileInputFormat.class);
       job.setOutputKeyClass(IntWritable.class);
       job.setOutputFormatClass(MapFileOutputFormat.class);
       SequenceFileOutputFormat.setCompressOutput(job, true);
       SequenceFileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
       SequenceFileOutputFormat.setOutputCompressionType(job,
           CompressionType.BLOCK);

       return job.waitForCompletion(true) ? 0 : 1;
      }

      public static void main(String[] args) throws Exception {
       int exitCode = ToolRunner.run(new SortByTemperatureToMapFile(), args);
       System.exit(exitCode);
      }
   }



   public class LookupRecordsByTemperature extends Configured implements Tool {

      @Override
      public int run(String[] args) throws Exception {
       if (args.length != 2) {
          JobBuilder.printUsage(this, "<path> <key>");
          return -1;
       }
       Path path = new Path(args[0]);
       IntWritable key = new IntWritable(Integer.parseInt(args[1]));

       Reader[] readers = MapFileOutputFormat.getReaders(path, getConf());
       Partitioner<IntWritable, Text> partitioner =
          new HashPartitioner<IntWritable, Text>();
       Text val = new Text();

       Reader reader = readers[partitioner.getPartition(key, val, readers.length)];
       Writable entry = reader.get(key, val);
       if (entry == null) {
          System.err.println("Key not found: " + key);
          return -1;
       }
       NcdcRecordParser parser = new NcdcRecordParser();
       IntWritable nextKey = new IntWritable();
       do {
          parser.parse(val.toString());
          System.out.printf("%s\t%s\n", parser.getStationId(), parser.getYear());
       } while(reader.next(nextKey, val) && key.equals(nextKey));
       return 0;
      }

      public static void main(String[] args) throws Exception {
       int exitCode = ToolRunner.run(new LookupRecordsByTemperature(), args);
       System.exit(exitCode);
      }
   }



3. 全排序（ Total Sort ）
--------------------------------------------------------------------------------------------------------------------------------------------------------
如何用 Hadoop 产生一个全局的排序文件？最简单的方法是使用一个分区(use a single partition).但这种方法在处理大型文件时是非常低效的，因为一台机器必须处理所有
输出文件，从而完全丧失了 MapReduce 所提供的并行架构的优势。
替代方案，产生一系列排序好的文件，如果连接起来(if concatenated),可以构成一个全局排序的文件。要做到这样，其中的奥秘就是用一个 partitioner 来控制全部输出的
次序。例如，如果有四个分区，我们可以把小于 –10°C 的气温的 key 放入第一个分区， –10°C 到 0°C 的放入第二个分区， 0°C 到 10°C 的放入第三个分区，大于 10°C 的
key 放入第四个分区。

尽管这个方案可以工作，也需要小心选择分区大小以确保它们相当均匀，这样作业的时间就不会受控于某个单个的 reducer 。

要构建一个相对均匀的分区，我们需要更好地理解整个数据集气温的分布情况。写一个 MapReduce 作业来计数落入各个气温桶的记录数并不困难。虽然我们能够利用这些信息
构建出一个非常均匀的分区，但实际上我们需要运行一个作业在整个数据集上来构造它们不是个好的想法。通过对键空间进行采样(sampling the key space),可以获得相当均
匀的分区集(set of partitions)。
采样背后的思想是查看键的一小部分子集来获得键的近似分布，并由此构建分区。幸运的是，不需要用户自己写代码， Hadoop 自带了一些可供选择的采样器。

InputSampler 类定义了一个内嵌的 Sampler 接口，它的实现是给定一个 InputFormat 和 Job 返回一个采样 key 数组

   public interface Sampler<K, V> {
       K[] getSample(InputFormat<K, V> inf, Job job)
               throws IOException, InterruptedException;
   }


这个接口通常不由客户端直接调用。而是由 InputSampler 类的静态方法 writePartitionFile() 调用，创建一个顺序文件存储 key 来定义分区：

   public static <K, V> void writePartitionFile(Job job, Sampler<K, V> sampler)
           throws IOException, ClassNotFoundException, InterruptedException

顺序文件由 TotalOrderPartitioner 使用，为排序作业创建分区。

// A MapReduce program for sorting a SequenceFile with IntWritable keys using the TotalOrderPartitioner to globally sort the data
public class SortByTemperatureUsingTotalOrderPartitioner extends Configured
implements Tool {

@Override
public int run(String[] args) throws Exception {
    Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);
    if (job == null) {
      return -1;
    }

    job.setInputFormatClass(SequenceFileInputFormat.class);
    job.setOutputKeyClass(IntWritable.class);
    job.setOutputFormatClass(SequenceFileOutputFormat.class);
    SequenceFileOutputFormat.setCompressOutput(job, true);
    SequenceFileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
    SequenceFileOutputFormat.setOutputCompressionType(job,
        CompressionType.BLOCK);

    job.setPartitionerClass(TotalOrderPartitioner.class);

    InputSampler.Sampler<IntWritable, Text> sampler =
      new InputSampler.RandomSampler<IntWritable, Text>(0.1, 10000, 10);

    InputSampler.writePartitionFile(job, sampler);

    // Add to DistributedCache
    Configuration conf = job.getConfiguration();
    String partitionFile = TotalOrderPartitioner.getPartitionFile(conf);
    URI partitionUri = new URI(partitionFile);
    job.addCacheFile(partitionUri);

    return job.waitForCompletion(true) ? 0 : 1;
}

public static void main(String[] args) throws Exception {
    int exitCode = ToolRunner.run(
        new SortByTemperatureUsingTotalOrderPartitioner(), args);
    System.exit(exitCode);
}
}


程序使用 RandomSampler 以指定的采样率均匀地从一个数据集中选择样本，本例中，采样率设为 0.1 。 RandomSampler 的输入参数还包括最大样本数和最大分区，
本例中这两个参数分别为 10000 和 10，这也是 InputSampler 作为应用程序运行时的默认值。只要任意一个限制条件满足就停止采样。采样器在客户端运行，因此，
限制分片的下载数量以加速采样器的运行就尤为重要。在实践中，采样器的运行时间仅占作业总运行时间的一小部分。

为了和集群上运行的其他任务共享分区文件，InputSampler 需要将其所写的分区文件加到分布式缓存中。

输入数据的特性决定所使用的最合适的采样器。例如，SplitSampler，仅采样分片中最开始的 n 条记录，这对已排序的数据不是很好的，因为没有从整个分片中选择键。
另一方面， IntervalSampler 以一定的间隔从整个分片中选择键，因此对于排序好的数据是更好的选择。RandomSampler 是优秀的通用目的采样器（ general-purpose ）
如果没有合适的采样器适合应用的需要（记住，采样的目的是创建大小近似相等的分区），则只能自己写 Sampler 接口的实现。

InputSampler 和 TotalOrderPartitioner 一个非常好的属性是可以自由选择分区数量 —— 也就是 reducer 的数量。然而，TotalOrderPartitioner 只能用于分区边界都不相同
的条件下。一个问题是选择太高的分区数量可能导致非常小的键空间，从而造成数据冲突。

运行：
       % hadoop jar hadoop-examples.jar SortByTemperatureUsingTotalOrderPartitioner \
       -D mapreduce.job.reduces=30 input/ncdc/all-seq output-totalsort

程序产生30个输出分区，每个在内部都是排好序的；另外，对于这些分区，在分区 i 中的所有的 key 都小于分区 i+1 里的 key 。


4. 辅助排序 (Secondary Sort)
--------------------------------------------------------------------------------------------------------------------------------------------------------
MapReduce framework 在记录到达 reducer 之前对记录按 key 进行排序。然而，对任何特定的 key, values 是不排序的。在不同的轮次中，value 出现的次序甚至是不稳定
的，因为这些值来自于不同的 map 任务，它们在各自的轮次中可能以不同的时间结束。一般来说，大多数 MapReduce 程序不依赖于value 在 reduce function 中出现的次序。
然而，通过特定的方法对 key 排序和分组来强行对 value 进行排序是可以做到的。

为了说明这种思想，考虑这样一个 MapReduce 程序，计算每年的最高气温。
为此，把键变成组合键（ composite ）：由年份和气温组合。对键排序的次序为 ———— 先按年份升序（ ascending) 排序，然后按气温降序（ descending ）排序:
   1900 35°C
   1900 34°C
   1900 34°C…
   1901 36°C
   1901 35°C

如果仅仅使用组合键，并没有什么帮助，因为对于同一年的记录会由不同的键，因此不会进入同一个 reducer 。例如，(1900, 35°C) 和 (1900, 34°C) 会进入不同的 reducer。
通过设置一个 partitioner, 按键的年份部分进行分区，可以确保同一年的记录进入同一个 reducer。但这样做对于解决我们的问题还是不够。 partitioner 仅确保一个
reducer 会接收一年的所有记录；但它没有改变这样的事实， reducer 在分区内以 key 分组(the reducer groups by key within the partition)。最后一个问题的解决方案
是设置分组控制。如果reducer 中的 values 按 key 的年份进行分组，则一个 reducer 分组中将包括同一年份的所有记录。并且，由于它们是按气温降序排序的，因此，第一
个就是最高气温。

下面对按值排序(sorting by value)做个总结：
   1. 定义组合键，包括自然键和自然值（natural key and the natural value）
   2. 排序的 comparator 根据组合键对记录进行排序，即同时用自然键和自然值进行排序
   3. 针对组合键的 partitioner 和 grouping 的 comparator ，进行分区和分组时，只考虑自然键（natural key）

Java code
--------------------------------------------------------------------------------------------------------------------------------------------------------
// vv MaxTemperatureUsingSecondarySort
public class MaxTemperatureUsingSecondarySort
extends Configured implements Tool {

static class MaxTemperatureMapper
    extends Mapper<LongWritable, Text, IntPair, NullWritable> {

    private NcdcRecordParser parser = new NcdcRecordParser();

    @Override
    protected void map(LongWritable key, Text value,
        Context context) throws IOException, InterruptedException {

      parser.parse(value);
      if (parser.isValidTemperature()) {
        /*[*/context.write(new IntPair(parser.getYearInt(),
            parser.getAirTemperature()), NullWritable.get());/*]*/
      }
    }
}

static class MaxTemperatureReducer
    extends Reducer<IntPair, NullWritable, IntPair, NullWritable> {

    @Override
    protected void reduce(IntPair key, Iterable<NullWritable> values,
        Context context) throws IOException, InterruptedException {

      /*[*/context.write(key, NullWritable.get());/*]*/
    }
}

public static class FirstPartitioner
    extends Partitioner<IntPair, NullWritable> {

    @Override
    public int getPartition(IntPair key, NullWritable value, int numPartitions) {
      // multiply by 127 to perform some mixing
      return Math.abs(key.getFirst() * 127) % numPartitions;
    }
}

public static class KeyComparator extends WritableComparator {
    protected KeyComparator() {
      super(IntPair.class, true);
    }
    @Override
    public int compare(WritableComparable w1, WritableComparable w2) {
      IntPair ip1 = (IntPair) w1;
      IntPair ip2 = (IntPair) w2;
      int cmp = IntPair.compare(ip1.getFirst(), ip2.getFirst());
      if (cmp != 0) {
        return cmp;
      }
      return -IntPair.compare(ip1.getSecond(), ip2.getSecond()); //reverse
    }
}

public static class GroupComparator extends WritableComparator {
    protected GroupComparator() {
      super(IntPair.class, true);
    }
    @Override
    public int compare(WritableComparable w1, WritableComparable w2) {
      IntPair ip1 = (IntPair) w1;
      IntPair ip2 = (IntPair) w2;
      return IntPair.compare(ip1.getFirst(), ip2.getFirst());
    }
}

@Override
public int run(String[] args) throws Exception {
    Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);
    if (job == null) {
      return -1;
    }

    job.setMapperClass(MaxTemperatureMapper.class);
    /*[*/job.setPartitionerClass(FirstPartitioner.class);/*]*/
    /*[*/job.setSortComparatorClass(KeyComparator.class);/*]*/
    /*[*/job.setGroupingComparatorClass(GroupComparator.class);/*]*/
    job.setReducerClass(MaxTemperatureReducer.class);
    job.setOutputKeyClass(IntPair.class);
    job.setOutputValueClass(NullWritable.class);

    return job.waitForCompletion(true) ? 0 : 1;
}

public static void main(String[] args) throws Exception {
    int exitCode = ToolRunner.run(new MaxTemperatureUsingSecondarySort(), args);
    System.exit(exitCode);
}
}
// ^^ MaxTemperatureUsingSecondarySort

运行：返回每年的最高气温值

   % hadoop jar hadoop-examples.jar MaxTemperatureUsingSecondarySort \
       input/ncdc/all output-secondarysort
       % hadoop fs -cat output-secondarysort/part-* | sort | head

1901 317
1902 244
1903 289
1904 256
1905 283
1906 294
1907 283
1908 289
1909 278
1910 294

3   连接（ Joins ）
--------------------------------------------------------------------------------------------------------------------------------------------------------
MapReduce 能够执行大型数据集间的连接（ joins ），但自己从头开始编写代码来执行连接相当麻烦。不用从头开始写 MapReduce 程序，可以考虑使用高级别 framework,
例如， Pig, Cascading , Cruc, 或者 Spark ，这些框架都将连接操作作为它们的核心的一部分实现。

先简要描述待解决的问题。假设有两个数据集，一个气象站数据库(the weather stations database)，一个气象记录数据集(the weather records), 并考虑如何将二者合二
为一。我们想观察每个气象站的历史信息，以及在每条输出记录中列出的气象站的元数据信息。

如何实现连接依赖于数据集有多大以及它们是如何分区的。如果一个数据集很大(比如， the weather records)而另一个很小以至于可以分布到集群上的每一个节点上(例如，
as the station metadata is) 则可以执行一个 MapReduce 作业，将每个气象站的记录放在一起(例如在 station ID 进行部分排序)实现连接。 mapper 或 reducer 利用较
小的数据集按 station ID 查找气象站的元数据，使元数据可以被写到每条记录中去。

如果连接由 mapper 执行则称为 map 端连接(a map-side join), 如果连接由 reducer 执行则称为 reduce 端连接(a reduce-side join).

如果两个数据集规模都非常大，以至于没有哪个数据集可以被完全复制到集群的每个节点上，我们仍能够利用 MapReduce 的 map-side join 或 reduce-side join 连接它们，
这依赖于数据是如何组织的。一个常见的例子是一个 user database 和一个 user activity (such as access logs)的日志。对一个公众的服务(For a popular service)，
将用户数据库或日志(the logs) 分布到所有 MapReduce 节点是不可行的。

1.   map 端连接 (Map-Side Joins)
-------------------------------------------------------------------------------------------------------------------------------------------------------
大数据集输入之间执行 map 端连接在数据到达 map function 之前执行连接。为此， map 的输入必须是已分区并按特定方式排序好了的。每个数据集必须被切分成相同数量
的分区，而且每个源必须以相同的 key 排序(the join key)，某一个特定 key 的所有记录必须在一个分区内。这听起来似乎是一个很严格的要求(的确如此)，但这的确是
MapReduce 作业输出的描述。

map 端连接可用于连接多个作业的输出，这些输出具有相同数量的 reducer, 相同数量的 key,并且输出文件是不可分割的(例如，小于一个 HDFS 的块，或 gzip 压缩)。

利用 org.apache.hadoop.mapreduce.join 包中的 CompositeInputFormat 来运行一个 map-side join, CompositeInputFormat 类的输入源和连接类型(内连接或外连接)可以
通过一个连接表达式(join expression)配置。

org.apache.hadoop.examples.Join example 是一个通用的执行 map 端连接的命令行程序，该程序为任何指定的基于多个输入数据集连接的 mapper 和 reducer 运行一个
MapReduce 作业，以执行给定的数据集连接操作。

2.   reduce 端连接（Reduce-Side Joins）
-------------------------------------------------------------------------------------------------------------------------------------------------------
reduce 端连接比 map 端连接更常用，输入数据集不必以任何方式结构化，但由于两个数据集都需要通过 MapReduce shuffle ，因此效率要低些。
基本思路是， mapper 标记每条记录的源，并用连接键（join key）作为 map 的输出键（output key），使得同一个 key 的记录一起发送给同一个 reducer 。
以下几个技术 to make this work in practice:

   多输入 (Multiple inputs)
   ---------------------------------------------------------------------------------------------------------------------------------------------------
   数据集的输入源通常具有不同的数据格式，因此方便的做法是使用 MultipleInputs 类来分别解析和标记每个数据源。


   辅助排序 (Secondary sort)
   ---------------------------------------------------------------------------------------------------------------------------------------------------
   reducer 会看到两个数据源具有相同 key 的记录，但不保证记录按特定的次序排列。然而，要执行数据连接，从一个源获取数据要在从另一个源获取数据之前执行是
   非常重要的。对于之前的气象数据连接，每个 key 的气象台记录(the station record) 必须先获取到值，这样 reducer 才能用 station name 填充气象记录(weather
   records)并直接发送出去。

为了标记每条记录(To tag each record), 我们用 TextPair 作为 key (存储 station ID) 和标记(tag). tag 值的唯一要求是它们以这样的方式排序：station record 在
weather record 之前到来。这可以通过标记 station record 值为 0 ， weather record 值为 1 来做到。

代码如下：

// Mapper for tagging station records for a reduce-side join
// vv JoinStationMapper
public class JoinStationMapper
    extends Mapper<LongWritable, Text, TextPair, Text> {
private NcdcStationMetadataParser parser = new NcdcStationMetadataParser();

@Override
protected void map(LongWritable key, Text value, Context context)
      throws IOException, InterruptedException {
    if (parser.parse(value)) {
      context.write(new TextPair(parser.getStationId(), "0"),
          new Text(parser.getStationName()));
    }
}
}
// ^^ JoinStationMapper

// Mapper for tagging weather records for a reduce-side join
//vv JoinRecordMapper
public class JoinRecordMapper
    extends Mapper<LongWritable, Text, TextPair, Text> {
private NcdcRecordParser parser = new NcdcRecordParser();

@Override
protected void map(LongWritable key, Text value, Context context)
      throws IOException, InterruptedException {
    parser.parse(value);
    context.write(new TextPair(parser.getStationId(), "1"), value);
}

}
//^^ JoinRecordMapper

reducer 知道它首先接收到 station record ，因此它从 value 抽取出 station name ，然后把它作为每个输出记录的一个部分写出。

// Reducer for joining tagged station records with tagged weather records
// vv JoinReducer
public class JoinReducer extends Reducer<TextPair, Text, Text, Text> {

@Override
protected void reduce(TextPair key, Iterable<Text> values, Context context)
      throws IOException, InterruptedException {
    Iterator<Text> iter = values.iterator();
    Text stationName = new Text(iter.next());
    while (iter.hasNext()) {
      Text record = iter.next();
      Text outValue = new Text(stationName.toString() + "\t" + record.toString());
      context.write(key.getFirst(), outValue);
    }
}
}
// ^^ JoinReducer

-------------------------------------------------------------------------------------------------------------------------------------------------------
上述代码假设在 weather records 中的每个 station ID 都有一条记录在 station 的数据集中准确匹配。如果不是这样，就需要泛化代码，使用另一个 TextPair 对象将
标记放入到 value 对象中。reduce() method 才能够在处理 weather record 之前区分哪些条目是 station name ，检测并处理丢失或重复的条目。

作业的驱动类如下所示，这里的关键点是按 key 的第一部分，也就是 station ID 分区(partion) 和分组(group), 使用了一个自定义的 partitioner(KeyPartioner)和一个
自定义的分组 comparator (FirstComparator)

//Application to join weather records with station names
// vv JoinRecordWithStationName
public class JoinRecordWithStationName extends Configured implements Tool {

public static class KeyPartitioner extends Partitioner<TextPair, Text> {
    @Override
    public int getPartition(/*[*/TextPair key/*]*/, Text value, int numPartitions) {
      return (/*[*/key.getFirst().hashCode()/*]*/ & Integer.MAX_VALUE) % numPartitions;
    }
}

@Override
public int run(String[] args) throws Exception {
    if (args.length != 3) {
      JobBuilder.printUsage(this, "<ncdc input> <station input> <output>");
      return -1;
    }

    Job job = new Job(getConf(), "Join weather records with station names");
    job.setJarByClass(getClass());

    Path ncdcInputPath = new Path(args[0]);
    Path stationInputPath = new Path(args[1]);
    Path outputPath = new Path(args[2]);

    MultipleInputs.addInputPath(job, ncdcInputPath,
        TextInputFormat.class, JoinRecordMapper.class);
    MultipleInputs.addInputPath(job, stationInputPath,
        TextInputFormat.class, JoinStationMapper.class);
    FileOutputFormat.setOutputPath(job, outputPath);

    /*[*/job.setPartitionerClass(KeyPartitioner.class);
    job.setGroupingComparatorClass(TextPair.FirstComparator.class);/*]*/

    job.setMapOutputKeyClass(TextPair.class);

    job.setReducerClass(JoinReducer.class);

    job.setOutputKeyClass(Text.class);

    return job.waitForCompletion(true) ? 0 : 1;
}

public static void main(String[] args) throws Exception {
    int exitCode = ToolRunner.run(new JoinRecordWithStationName(), args);
    System.exit(exitCode);
}
}
// ^^ JoinRecordWithStationName

在样例数据上运行程序获得如下输出：

011990-99999 SIHCCAJAVRI 0067011990999991950051507004...
011990-99999 SIHCCAJAVRI 0043011990999991950051512004...
011990-99999 SIHCCAJAVRI 0043011990999991950051518004...
012650-99999 TYNSET-HANSMOEN 0043012650999991949032412004...
012650-99999 TYNSET-HANSMOEN 0043012650999991949032418004...

4   端数据分布 (Side Data Distribution)
-------------------------------------------------------------------------------------------------------------------------------------------------------
端数据(Side data)被定义为一个作业在处理主数据集时所需的额外的只读数据。所面临的挑战在于如何使端数据为所有 map 任务或 reduce 任务(散布在集群中)方便而高效
地使用。

   1. 使用作业配置 (Using the Job Configuration)
   ----------------------------------------------------------------------------------------------------------------------------------------------------
   可以利用 Configuration 类的各种 setter method 在作业配置上设置任意的 key-value 对。在需要向任务传递很少的元数据(a small piece of metadata to your
   tasks) 时这种方法很有用。

   在任务里，用户可以通过 Context 的 getConfiguration() method 获取配置信息。

   一般情况下，基本类型数据足以应付元数据编码，但对于更复杂类型对象，要么自己处理序列化工作（如果现有一个机制把对象转换为字符串以及字符串转换为对象），
   要么使用 Hadoop 的 Stringifier 类。DefaultStringifier 利用 Hadoop 的序列化框架处理对象序列化。

   不应使用这种机制传输几千字节的数据，因为这种机制会产生 MapReduce 组件内存使用的压力。作业的配置总是被客户端, application master, 任务运行的 JVM 读取，
   而每次读取，所有的项目都要读入内存，即便有些内容并不需要。


   2. 分布式缓存 (Distributed Cache)
   ----------------------------------------------------------------------------------------------------------------------------------------------------
   与在作业配置中序列化 side data 相比，利用 Hadoop 的分布式缓存机制(Hadoop’s distributed cache mechanism)分布数据集更受青睐。它提供了一个服务，能够在
   任务运行时及时地将文件和存档复制到任务节点以供任务使用。为了节省带宽，在每个任务中，文件通常只复制到一个特定节点一次。


       用法 (Usage)
       -----------------------------------------------------------------------------------------------------------------------------------------------
       对于 GenericOptionsParser 来说，可以指定分发的文件，在 -files 选项上，指定以逗号分隔的 URI 列表作为选项参数，文件可以存放在本地文件系统, HDFS 或
       其他 Hadoop 可读取的文件系统(例如 S3)中。如果没有指定 scheme, 则假设文件是本地的。
       也可以利用 -archives 选项复制存档文件(JAR files, ZIP files, tar files, and gzipped tar files)到任务中，这些存档文件会在任务节点上被解档(unarchived)。
       -libjars 选项会添加 jar 文件到 mapper 和 reducer 任务的类路径。如果没有把某些 jar 文件打包到作业的 jar 文件里，这个选项就很有用。

       示例命令：
               % hadoop jar hadoop-examples.jar \
                   MaxTemperatureByStationNameUsingDistributedCacheFile \
                   -files input/ncdc/metadata/stations-fixed-width.txt input/ncdc/all output

       这条命令复制本地文件 stations-fixed-width.txt （没有提供 scheme ，因此路径自动解释为本地文件）到任务节点，使用它可以查找气象站名称。
       其中的 MaxTemperatureByStationNameUsingDistributedCacheFile 如下：

//Application to find the maximum temperature by station, showing station
//names from a lookup table passed as a distributed cache file

// vv MaxTemperatureByStationNameUsingDistributedCacheFile
public class MaxTemperatureByStationNameUsingDistributedCacheFile
extends Configured implements Tool {

static class StationTemperatureMapper
    extends Mapper<LongWritable, Text, Text, IntWritable> {

    private NcdcRecordParser parser = new NcdcRecordParser();

    @Override
    protected void map(LongWritable key, Text value, Context context)
        throws IOException, InterruptedException {

      parser.parse(value);
      if (parser.isValidTemperature()) {
        context.write(new Text(parser.getStationId()),
            new IntWritable(parser.getAirTemperature()));
      }
    }
}

static class MaxTemperatureReducerWithStationLookup
    extends Reducer<Text, IntWritable, Text, IntWritable> {

    /*[*/private NcdcStationMetadata metadata;/*]*/

    /*[*/@Override
    protected void setup(Context context)
        throws IOException, InterruptedException {
      metadata = new NcdcStationMetadata();
      metadata.initialize(new File("stations-fixed-width.txt"));
    }/*]*/

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values,
        Context context) throws IOException, InterruptedException {

      /*[*/String stationName = metadata.getStationName(key.toString());/*]*/

      int maxValue = Integer.MIN_VALUE;
      for (IntWritable value : values) {
        maxValue = Math.max(maxValue, value.get());
      }
      context.write(new Text(/*[*/stationName/*]*/), new IntWritable(maxValue));
    }
}

@Override
public int run(String[] args) throws Exception {
    Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);
    if (job == null) {
      return -1;
    }

    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    job.setMapperClass(StationTemperatureMapper.class);
    job.setCombinerClass(MaxTemperatureReducer.class);
    job.setReducerClass(MaxTemperatureReducerWithStationLookup.class);

    return job.waitForCompletion(true) ? 0 : 1;
}

public static void main(String[] args) throws Exception {
    int exitCode = ToolRunner.run(
        new MaxTemperatureByStationNameUsingDistributedCacheFile(), args);
    System.exit(exitCode);
}
}
// ^^ MaxTemperatureByStationNameUsingDistributedCacheFile

           提示：
           -----------------------------------------------------------------------------------------------------------------------------------------------
           当文件不适于放入内存时，可以使用分布式缓存机制复制文件。这方面 Hadoop 的 map 文件非常有用，因为它充当磁盘查找格式（on-disk lookup format）。
           因为 map 文件是一组已定义目录结构文件的集合，因此可以把它们归档到一个归档格式 (JAR, ZIP, tar, or gzipped tar) ，然后通过 -archives 选项把它们加入
           到缓存中。



       工作机制（How it works）
       -----------------------------------------------------------------------------------------------------------------------------------------------
       当用户启动一个作业， Hadoop 会把由 -files, -archives, 以及 -libjars 选项指定的文件复制到分布式文件系统（通常为 HDFS ）。然后，任务运行前， node
       manager 从分布式文件系统复制文件到本地磁盘 ———— 缓存，这样任务就能够访问到这些文件。此时，这些文件被视为 "本地化"(localized)了。从任务的角度看，
       文件就在那儿了，符号连接到任务的工作目录。另外，在任务启动前，由 -libjars 选项指定的文件被添加到任务的类路径。

       node manager 也维护缓存中每个文件被任务使用的引用计数。任务运行前，被任务使用的文件的引用计数增 1；任务运行结束后，计数器减 1 。只有当文件不再被
       使用了(计数器值到达 0)
       表示它可以从缓存中移除。当节点缓存超出一定大小——默认 10GB ，文件会被删除以释放空间共新文件使用。缓存大小可以通过设置如下属性而改变：

           yarn.nodemanager.localizer.cache.target-size-mb

       尽管这种设计不能确保同一个作业在同一个节点上运行的后续任务在缓存中一定能找到它们需要的文件，但可能性非常大：因为一个作业的所有任务通常被调度为
       几乎同时运行，因此，不会有足够的机会使得其他作业运行从而导致原来的任务文件从缓冲区删除。


       分布式缓存 API (The distributed cache API)
       -----------------------------------------------------------------------------------------------------------------------------------------------
       大多数应用程序不需要使用分布式缓存 API ，因为可以通过 GenericOptionsParser 使用它们。然而，如果没有使用 GenericOptionsParser, 就可以使用 Job 上
       的 API 将对象放到分布式缓存中去，下面是 Job 内相关的方法：

           public void addCacheFile(URI uri)
           public void addCacheArchive(URI uri)
           public void setCacheFiles(URI[] files)
           public void setCacheArchives(URI[] archives)
           public void addFileToClassPath(Path file)
           public void addArchiveToClassPath(Path archive)

       由两类对象可以放到缓存中：文件和存档文件。文件被直接放到任务节点不用动，而存档文件需要在任务节点上解档。对每种类型的对象都由三个方法：
           addCacheXXXX() method    ：添加文件或存档到分布式缓冲中
           setCacheXXXXs() method   ：一次调用中设置整个加入缓存中去的文件或存档列表
           addXXXXToClassPath() method   ：添加文件或存档文件到 MapReduce 任务的 classpath 中去。


           提示：
           -------------------------------------------------------------------------------------------------------------------------------------------
           add 和 set 方法引用的 URI 必须是共享文件系统内的文件，作业运行前必须存在。而在 GenericOptionsParser 中指定的文件名可以指定本地文件(local
           fields), 它们会被复制到默认的共享文件系统（通常为 HDFS）。

           这是直接使用 Java API与GenericOptionsParser 关键区别: Java API 不会复制在 add 或 set 方法中指定的文件到共享文件系统，而 GenericOptionsParser
           会复制。


       在任务中从分布式缓存获取文件跟之前一样：通过文件名直接访问本地文件。因为 MapReduce 总是会从任务的工作目录为每个通过添加到分布式缓存的文件或存档
       文件创建符号连接。存档文件会被解档，因此用户可以访问通过内嵌的路径访问里面的文件。

*
*
*

5   MapReduce 类库 MapReduce Library Classes ）
-------------------------------------------------------------------------------------------------------------------------------------------------------
Hadoop 还为 mapper 和 reducer 提供了一个包含常用函数的库。

                       MapReduce library classes

   +===============================+=======================================================================+
   |           类                   |                           描述                                       |
   +-------------------------------+-----------------------------------------------------------------------+
   | ChainMapper, ChainReducer       | Run a chain of mappers in a single mapper and a reducer               |
   |                               | followed by a chain of mappers in a single reducer, respectively.       |
   |                               |(Symbolically, M+RM*, where M is a mapper and R is a reducer.)           |
   |                               | This can substantially reduce the amount of disk I/O incurred           |
   |                               | compared to running multiple MapReduce jobs.                           |
   +-------------------------------+-----------------------------------------------------------------------+
   | FieldSelectionMapper            | A mapper and reducer that can select fields (like the Unix cut       |
   |                               | command) from the input keys and values and emit them as               |
   | FieldSelectionReducer           | output keys and values.                                               |
   +-------------------------------+-----------------------------------------------------------------------+
   | IntSumReducer, LongSumReducer   | Reducers that sum integer values to produce a total for every key.   |
   +-------------------------------+-----------------------------------------------------------------------+
   | InverseMapper                   | A mapper that swaps keys and values.                                   |
   +-------------------------------+-----------------------------------------------------------------------+
   | MultithreadedMapper (new API)   | A mapper (or map runner in the old API) that runs mappers               |
   |                               | concurrently in separate threads. Useful for mappers that are not       |
   |                               | CPU-bound.                                                           |
   +-------------------------------+-----------------------------------------------------------------------+
   | TokenCounterMapper           | A mapper that tokenizes the input value into words (using Java’s       |
   |                               | StringTokenizer) and emits each word along with a count of 1.           |
   +-------------------------------+-----------------------------------------------------------------------+
   | RegexMapper                   | A mapper that finds matches of a regular expression in the input       |
   |                               | value and emits the matches along with a count of 1.                   |
   +-------------------------------+-----------------------------------------------------------------------+

MapReduce 特性 （ MapReduce Features ）

猜你喜欢

MapReduce 特性（ MapReduce Features ）