Word Count Example of Hadoop V1.0 - Hadoop Job Start

By official count examples word document to learn how to use Hadoop distributed computing.

1. Inputs and Outputs

Any program, input and output are required, the frame can only operate the MapReduce Hadoop <Key, Value> pair, the input and output that is Hadoop is <Key, Value> pair, of course, key value, and may have various types .

Hadoop key and value are serialized. Hadoop default serialization mechanism requires a key and value realization Writable interface. In addition, key must be sorted, so the need to implement key WritableComparable interface.

Here is a simple process of a MapReduce job:

(input) <k1, v1> -> map -> <k2, v2> -> combine* -> <k2, v2> -> reduce -> <k3, v3> (output)

Note, the number of which combine phase may be executed 0 times, it may be repeated.

2. Example: Word Count 1.0

WordCount, by definition, that is, the number of all input words that appear in the statistics out.

First we look at the code to implement the following two functions in WordCount class:

01 public int run(String [] args) throws Exception {
02     Job job = new Job(getConf());
03     job.setJarByClass(WordCount.class);
04     job.setJobName("wordcount");
05 
06     job.setOutputKeyClass(Text.class);
07     job.setOutputValueClass(IntWritable.class);
08 
09     job.setMapperClass(Map.class);
10     job.setCombinerClass(Reduce.class);
11     job.setReducerClass(Reduce.class);
12 
13     job.setInputFormatClass(TextInputFormat.class);
14     job.setOutputFormatClass(TextOutputFormat.class);
15 
16     FileInputFormat.setInputPaths(job, new Path(args[0]));
17     FileOutputFormat.setOutputPath(job, new Path(args[1]));
18 
19     boolean success = job.waitForCompletion(true);
20     return success ? 0 : 1;
21 }
22 
23 public static void main(String[] args) throws Exception {
24     int ret = ToolRunner.run(new WordCount(), args);
25     System.exit(ret);
26 }

Wherein the inlet is a main function, wherein ToolRunner.run (new WordCount (), args) start MapReduce job. WordCount MapReduce logic programmer is implemented. ToolRunner.run (new WordCount (), args) equivalent to the implementation of the WordCount.java run (args).

Here we look at the run function:

02 Job job = new Job(getConf());
03 job.setJarByClass(WordCount.class);
04 job.setJobName("wordcount");

First, according Configuration to initialize job, you may be wondering achieve getConf () in which? This method is inherited from Configured class. Then setJarByClass, to determine the class jar file by file. MapReduce Job finally give a name.

06 job.setOutputKeyClass(Text.class);
07 job.setOutputValueClass(IntWritable.class);

这两行是确定输出的KeyValue对的类型分别是什么。Text对应String,IntWritable即integer。

09 job.setMapperClass(Map.class);
10 job.setCombinerClass(Reduce.class);
11 job.setReducerClass(Reduce.class);

指定mapper和reducer的class。其中Combiner的作用是对mapper输出的中间结果做local aggregation,这样可以降低mapper和reducer之间传输的数据量。

13 job.setInputFormatClass(TextInputFormat.class);
14 job.setOutputFormatClass(TextOutputFormat.class);

设置输入输出文件的类型。对于TextInputFormat.class,即一行作为一个map的输入,然后默认按空白切分。

16 FileInputFormat.setInputPaths(job, new Path(args[0]));
17 FileOutputFormat.setOutputPath(job, new Path(args[1]));

设置输入输出文件的路径

19 boolean success = job.waitForCompletion(true);

job.waitForCompletion(true)把Job提交给JobTracker,并且监控该job,直到job完成后返回。

转载于:https://www.cnblogs.com/licheng/archive/2011/11/08/2241721.html

Guess you like

Origin blog.csdn.net/weixin_33869377/article/details/92627513