hadoop learning record—2.8.2documentation—mapreduce Tutorial

1. Overview

Hadoop MapReduce is a software framework on which it is easy to write programs that process large amounts of data (terabytes of data sets) in parallel and run reliably and fault-tolerantly on large clusters of inexpensive hardware (thousands of nodes).
MapReduce jobs often split the input dataset into separate chunks, which are then processed in a fully parallel fashion with map tasks. The framework sorts the maps output as input to the reduce task. Typically the input and output of a job are stored in the file system. The framework handles task scheduling, monitoring, and re-execution of failed tasks.
Typically, the compute node and the storage node are the same node, that is, the MapReduce framework and HDFS run on the same set of nodes. This configuration allows the framework to efficiently schedule tasks on nodes where data is readily available, resulting in very high aggregate bandwidth across nodes.
The MapReduce framework has a single resourceManager master, one nodemanager slave per cluster node, and one MRAppmaster per application.
At a minimum, the application specifies input and output addresses, and provides map and reduce functionality by implementing appropriate interfaces and abstract classes. These and other job parameters make up the job configuration.
The hadoop job client submits the job (jar/executable, etc.) and configures it to the resourcemanager. The resourcemanager is responsible for distributing the software, configuring the slaves, arranging tasks, monitoring tasks, and providing status and diagnostic information to the job client.
Although the hadoop framework is implemented in java, MapReduce applications do not need to be written in java.
- hadoop steaming is a tool that allows users to build and run jobs using any executable program (such as shell utilities) as a mapper/reducer.
- hadoop pipes is a SWIG compatible C++ API for implementing MapReduce applications (not based on JNI).

2. inputs and outputs

The MapReduce framework specifically operates on < key, value > pairs, that is, the framework treats the input as a < key, value > set, and generates a set of < key, value > set as the output of the job, conceivably of different types. The
key and value classes must be Serialized by the framework, so it needs to implement the writable interface. Additionally, the key class must implement the writablecompareble interface to facilitate frame ordering.
Input and output types of a MapReduce job:
(input)< k1,v1 > —> map —> < k2,v2 > —> combine —> < k2,v2 > —> reduce —> < k3,v3 >(output)

3. wordcount v1.0

wordcount是一个简单的应用，统计一个给定输入集的每个单词的出现次数。

1. Source code

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

2. Use

设定环境变量如下：

export JAVA_HOME=/usr/java/default
export PATH=${JAVA_HOME}/bin:${PATH}
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar

编译WordCount.java，建立jar文件

$ bin/hadoop com.sun.tools.javac.Main WordCount.java
$ jar cf wc.jar WordCount*.class

假定：
 - /user/joe/wordcount/input - hdfs中的input文件夹
 - /user/joe/wordcount/input - hdfs中的input文件夹
作为input的样例文本文件：

$ bin/hadoop fs -ls /user/joe/wordcount/input/
/user/joe/wordcount/input/file01
/user/joe/wordcount/input/file02

$ bin/hadoop fs -cat /user/joe/wordcount/input/file01
Hello World Bye World

$ bin/hadoop fs -cat /user/joe/wordcount/input/file02
Hello Hadoop Goodbye Hadoop

运行程序：

$ bin/hadoop jar wc.jar WordCount /user/joe/wordcount/input /user/joe/wordcount/output
测试结果：
[hadoop@hadoop ~]$ hadoop jar wc.jar mvn.mvnsample.WordCount wordcount/input wordcount/output
17/11/02 12:18:10 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/11/02 12:18:53 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
17/11/02 12:18:55 INFO input.FileInputFormat: Total input paths to process : 2
17/11/02 12:18:56 INFO mapreduce.JobSubmitter: number of splits:2
17/11/02 12:18:57 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1509592542822_0002
17/11/02 12:19:49 INFO impl.YarnClientImpl: Submitted application application_1509592542822_0002
17/11/02 12:21:20 INFO mapreduce.Job: The url to track the job: http://hadoop:8088/proxy/application_1509592542822_0002/
17/11/02 12:21:20 INFO mapreduce.Job: Running job: job_1509592542822_0002
17/11/02 12:22:52 INFO mapreduce.Job: Job job_1509592542822_0002 running in uber mode : false
17/11/02 12:22:53 INFO mapreduce.Job:  map 0% reduce 0%
17/11/02 12:24:32 INFO mapreduce.Job:  map 100% reduce 0%
17/11/02 12:26:05 INFO mapreduce.Job:  map 100% reduce 100%
17/11/02 12:27:37 INFO mapreduce.Job: Job job_1509592542822_0002 completed successfully
17/11/02 12:27:39 INFO mapreduce.Job: Counters: 49
        File System Counters
                FILE: Number of bytes read=79
                FILE: Number of bytes written=362146
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=292
                HDFS: Number of bytes written=41
                HDFS: Number of read operations=9
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters
                Launched map tasks=2
                Launched reduce tasks=1
                Data-local map tasks=2
                Total time spent by all maps in occupied slots (ms)=203474
                Total time spent by all reduces in occupied slots (ms)=87282
                Total time spent by all map tasks (ms)=203474
                Total time spent by all reduce tasks (ms)=87282
                Total vcore-milliseconds taken by all map tasks=203474
                Total vcore-milliseconds taken by all reduce tasks=87282
                Total megabyte-milliseconds taken by all map tasks=208357376
                Total megabyte-milliseconds taken by all reduce tasks=89376768
        Map-Reduce Framework
                Map input records=2
                Map output records=8
                Map output bytes=82
                Map output materialized bytes=85
                Input split bytes=242
                Combine input records=8
                Combine output records=6
                Reduce input groups=5
                Reduce shuffle bytes=85
                Reduce input records=6
                Reduce output records=5
                Spilled Records=12
                Shuffled Maps =2
                Failed Shuffles=0
                Merged Map outputs=2
                GC time elapsed (ms)=1551
                CPU time spent (ms)=11070
                Physical memory (bytes) snapshot=644284416
                Virtual memory (bytes) snapshot=6374944768
                Total committed heap usage (bytes)=449314816
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=50
        File Output Format Counters
                Bytes Written=41

输出：

$ bin/hadoop fs -cat /user/joe/wordcount/output/part-r-00000
Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2

程序能通过使用“-files”选项指定一个逗号分隔的目录清单，这些目录存在于当前任务的工作目录。“-libjars”选项允许程序增加jar包到maps和reduces的classpaths中。“-archives”允许把逗号分隔的archives清单作为参数。这些档案是解压的，在当前任务的工作目录中建立有这个档案名字的链接。
使用-libjars,-files,-archives运行wordcount样例程序：

bin/hadoop jar hadoop-mapreduce-examples-<ver>.jar wordcount -files cachefile.txt -libjars mylib.jar -archives myarchive.zip input output

这里，会生成myarchive.zip，解压成myarchive.zip文件夹。
用户可以通过-files和-archives选项、使用#为文件和档案指定一个不同的符号名称。比如：

bin/hadoop jar hadoop-mapreduce-examples-<ver>.jar wordcount -files dir1/dict.txt#dict1,dir2/dict.txt#dict2 -archives mytar.tgz#tgzdir input output

这里dir1/dict.txt和dir2/dict.txt可以被任务通过相应地使用符号名称dict1和dict2来访问。会生成mytar.tgz档案，并解压成tgzdir名字的文件夹。

3. walk-through

wordcount应用是非常直接的：

public void map(Object key, Text value, Context context
                ) throws IOException, InterruptedException {
  StringTokenizer itr = new StringTokenizer(value.toString());
  while (itr.hasMoreTokens()) {
    word.set(itr.nextToken());
    context.write(word, one);
  }
}

通过map方法实现mapper，一次处理由指定的TextInputFormat提供的一行。然后通过StringTokenizer把这行分解成有空格分隔的tokens，然后产生一个key-value对< <  word >, 1>。

对应给定的输入样例，第一个map产生：
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
第二个map产生：
< Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>
我们会稍后学习更多关于一个给定的job会产生maps的数量，和如何用一种条理清晰的方式控制它们。

 job.setCombinerClass(IntSumReducer.class);

 wordcount也指定一个combiner，这样每个map的输出在按key被排序后，会通过local combiner（）做local aggregation。
 第一个map的输出：
 < Bye, 1>
< Hello, 1>
< World, 2>
第二个map的输出：
< Goodbye, 1>
< Hadoop, 2>
< Hello, 1>

public void reduce(Text key, Iterable<IntWritable> values,
                   Context context
                   ) throws IOException, InterruptedException {
  int sum = 0;
  for (IntWritable val : values) {
    sum += val.get();
  }
  result.set(sum);
  context.write(key, result);
}

通过reduce方法实现reducer，计算出每个key（就是该例中的words）出现的次数值。
这样这个job的输出是：
< Bye, 1>
< Goodbye, 1>
< Hadoop, 2>
< Hello, 2>
< World, 2>
main方法指定这个job的很多面，比如input/output路径（通过命令行传递），key/value类型，input/output格式等等。然后调用job.waitForCompletion来提交job，并监控他的进度。
我们稍后学习更多关于job、InputFormat、OutputFormat、其他接口和类。

4. mapreduce user interface

This section provides reasonable details about each user-facing aspect of the MapReduce framework. This will help users implement, configure and debug their jobs in a coherent manner. However, please note that the javadoc for each class and interface is still the most comprehensive documentation available. This section is just a guide.
Let me first look at the mapper and reducer interfaces. Programs typically implement them to provide map and reduce methods.
We next discuss other core interfaces including Job, Partitioner, InputFormat, OutputFormat and more.
Finally, we conclude by discussing some useful features in the framework, such as DistributedCache, IsolationRunner, etc.

1. payload

Typically programs implement the mapper and reducer interfaces to provide map and reduce methods. The following constitutes the core of the job.

1. mapper

The mapper maps key/value inputs to a set of intermediate key/values.
maps are separate tasks that convert input records into intermediate records. The transformed intermediate record does not need to be of the same type as the input record. A given input pair maps to 0 or many output pairs.
The hadoop MapReduce framework generates a map task for each inputSplit generated by the job's InputFormat.
Overall, the mapper implementation is passed to the job via the Job.setMapperClass(Class) method. The framework then calls the map(WritableComparable, Writable, Context) method for each key/value pair in the task's inputSplit. The program then overrides the cleanup(Context) method to perform any required cleanup.
The output pair needs to have the same type as the input pair. A given input pair maps to 0 or many output pairs. Call context.write(WritableComparable, Writable) to collect output pairs.
Programs can use Counter to report statistics.
All intermediate values corresponding to a given output pair are then grouped by the framework and passed to the Reducer to determine the final output. Users can specify a Comparator to control grouping through Job.setGroupingComparatorClass(Class).
After the mapper output is sorted, partition by reducer. The number of partitions is the same as the number of reduce tasks in the job. Users can control which keys go to which reducers by implementing a custom partitioner.
The intermediate, sorted output is always stored in the simple (key-len,key,value-len,value) format. The program can control whether and how to compress the intermediate output, by configuring the CompressionCodec.
1. how many maps
The number of maps is driven by the total size of the inputs, which is the total number of blocks in the input file.
Although set to 300maps per light cpu map task, the best level of maps parallelism is probably 10-100maps per node. The task setup takes a little time, so it is best for maps to run for at least 1 minute.
So, if you expect the input file to be 10TB and the blocksize to be 128MB, you will use 82,000 maps, unless Configuration.set(MRJobConfig.NUM_MAPS, int) is set higher.

2. reducer

The reducer decomposes a series of intermediate values that share a key with a small series of values.
The number of reducers in the job is set by Job.setNumReduceTasks(int).
In general, the reducer implementation is passed to the job through the Job.setReducerClass(Class) method, and the reducer implementation overloads the job initialization itself. The framework then calls the reduce(WritableComparable, Iterable, Context) method for each <key,(list of values)> in the grouped input. The program then overrides the cleanup(Context) method to perform the required cleanup.
The reducer has three main phases: shuffle, sort and reduce.
1. The input of the shuffle
reducer is the sorted mappers output. At this stage the framework retrieves the relevant partitions of all mappers' outputs via http.
2. sort
At this stage, the framework groups the reducer input by keys (since different mappers have the same key).
The shuffle and sort phases occur at the same time, and when the map output is retrieved, they are merged.
3. Secondary sort
If the rules for grouping intermediate values are different from the rules for grouping values before reduction, you need to formulate a comparator through Job.setSortComparatorClass(Class). Since Job.setGroupingComparatorClass(Class) can be used to control how intermediate values are grouped, these can be combined to simulate a second ordering of values.
4. reduce
At this stage, in the grouped input for each

3. partitioner

The partitioner partitions the key space.
The partitioner controls the partitioning of the keys output by the intermediate map. The key (or a subset of the key) is used to obtain the partition, especially the hash function. The total number of partitions is the same as the number of reduce tasks. This controls which intermediate values (and records) are sent to the m reduce task for reduction.
HashPartitioner is the default partitioner.

4. counter

counter is used by the MapReduce program to report its statistics resources.
Implementations of mappers and reducers can use counter to report statistics.
hadoop MapReduce comes with the universally useful mappers, reducers and partitioners libraries.

2. job configuration

job represents the MapReduce job configuration.
Job is the main interface for users to describe the Hadoop framework to execute MapReduce jobs. The framework tries to execute the job exactly as the job describes, however:
- Some configuration parameters are marked as final by the administrator and therefore cannot be changed.
- While some job parameters (such as Job.setNumReduceTasks(int)) can be set directly, other parameters subtly interact with other parts of the framework and job configuration, and are difficult to set (such as Configuration.set(JobContext.NUM_MAPS, int )).
Typically, jobs are used to specify mapper, combiner (if any), partitioner, reducer, InputFormat, and OutputFormat implementations. fileinputformat indicates the settings of the input file (FileInputFormat.setInputPaths(Job, Path…)/ FileInputFormat.addInputPath(Job, Path)) and (FileInputFormat.setInputPaths(Job, String…)/ FileInputFormat.addInputPaths(Job, String)) and the output file The location of the output (FileOutputFormat.setOutputPath(Path)).
Sometimes the job is also used to specify other high-level aspects of the job such as the comparator to use, the files to put in the DistributedCache, whether intermediate or job output is compressed (and how), and whether the job task is executed in a speculative fashion (setMapSpeculativeExecution(boolean ))/ setReduceSpeculativeExecution(boolean)), the maximum number of attempts per task (setMaxMapAttempts(int)/ setMaxReduceAttempts(int)), etc.
Of course, users can use Configuration.set(String, String)/ Configuration.get(String) to set/get any parameters the program needs. Anyway, use DistributedCache for massive (read-only) data.

3. Task Execution & Environment

When MRAppMaster executes mapper/reducer tasks, it is performed as a sub-process in a separate jvm.
Subtasks inherit the environment of the parent MRAPPMaster. The user can specify additional options for the child jvm through mapreduce.{map|reduce}.java.opts, and specify the configuration parameters in the job such as -Djava.library.path=<> to set the runtime linker to search for shared libraries. Standard path. If the mapreduce.{map|reduce}.java.opts parameter contains the @taskid@ symbol, it represents the taskid value of the inserted MapReduce task.
Here is an example, with many parameters and substitutions, showing jvm gc logging, starting a passwordless jvm jmx client to be able to connect to jconsole, the likes to watch submemory, threads, get thread dumps. Also set the maximum heap-size of the map and reduce sub-JVMs to 512MB and 1024MB. Also added java.library.path extra path for child jvm.

        <property>
          <name>mapreduce.map.java.opts</name>
          <value>
         -Xmx512M -Djava.library.path=/home/mycompany/lib -verbose:gc -Xloggc:/tmp/@taskid@.gc
          -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false
          </value>
        </property>

        <property>
         <name>mapreduce.reduce.java.opts</name>
          <value>
          -Xmx1024M -Djava.library.path=/home/mycompany/lib -verbose:gc -Xloggc:/tmp/@taskid@.gc
          -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false
          </value>
        </property>

memory management

Users and administrators can specify the maximum virtual memory of subtasks and recursively launched subprocesses through mapreduce.{map|reduce}.memory.mb. Note that the value set here is the limit value for each process. The value of mapreduce.{map|reduce}.memory.mb is in (mega bytes) MB. And the value must be greater than or equal to the -Xmx assigned to the JavaVM, otherwise the vm will fail to start.
Note: mapreduce.{map|reduce}.java.opts is used to configure already started subtasks from MRAppMaster. Configuring memory options for daemons is described in "Configuring the Environment of the Hadoop Daemons".
The memory availability of some framework parts is also configurable. In map and reduce tasks, tuning parameters that affect concurrent operations and the frequency with which data hits disk can affect performance. Monitoring the filesystem's counters for a job (specifically the relative number of bytes from map to reduce) is important for debugging these parameters.

map parameters

write picture description here

shuffle/reduce parameters

configured parameters

task logs

distributing library

4. Job Submission and Monitoring

job control

5.job input

inputsplit

recordreader

5.job output

6.other useful features

Example: wordcount v2.0

There is a pit when displaying <word>: there needs to be a space between word and <>, otherwise it will not be displayed.