MapReduce programming Example: word count

This section describes how to write a basic MapReduce programs for data analysis. This section of code is developed based on Hadoop 2.7.3.

Mission preparation

Word count (WordCount) mission is a set of input document word count separately. Assuming that the file is larger than that, each document also contains a lot of words, you can not use the traditional linear program processing, and is the kind of problem you can play local advantages of MapReduce.

In front of " the MapReduce Example Analysis: word count " Tutorial basic idea has been presented and the implementation process implemented by the word count MapReduce. The following describes how to write the specific implementation code and how to run the program.

First, create a local file 3:, the specific contents of the file shown in Table 1. file00l, file002 and file003.

Table 1 word count input file
file name file001 file002 file003
document content Hello world
Connected world
One world
One dream
Hello Hadoop
Hello Map
Hello Reduce

Then create a directory using the input file HDFS commands.

hadoop fs -mkdir input

Then, file001, file002 file003 and uploaded to HDFS in the input directory.

hadoop fs -put file001 input
hadoop fs -put file002 input
hadoop fs -put file003 input

Program in MapReduce first task is to write Map program. In the word count task, Map needs to do is to input text data is split by word, and then output in the form of a particular key-value pairs.

Write Map program

Hadoop MapReduce framework has achieved the basic functions of Map tasks in the Mapper class. In order to achieve Map task, developers need only inherit class Mapper, and Map function to achieve the class.

In order to achieve a word count Map task, first class Mapper to set up the type of input and output types. Here, the input function is Map <key, value> form, wherein, key input file is the line number of a line, value is the row number corresponding to a single line.

Therefore, as a function of input type Map <LongWritable, Text>. Map function function of dividing the work to complete the text, the Map function output is <key, value> form, which, key is the word, value for the number of times the word appears. Therefore, the output is a function of the type of Map <Text, LongWritable>.

The following is a Map task implementation code word count program.

  1. public static class CoreMapper extends Mapper<Object,Text,Text,IntWritable> {
  2. private static final IntWritable one = new IntWritable(1);
  3. private static Text label = new Text();
  4. public void map(Object key,Text value,Mapper<Object,Text,Text,IntWritable> Contextcontext)throws IOException,InterruptedException {
  5. StringTokenizer tokenizer = new StringTokenizer(value.toString());
  6. while(tokenizer.hasMoreTokens()) {
  7. label.set(tokenizer.nextToken());
  8. context.write(label,one);
  9. }
  10. }
  11. }

In the above code, the task implement Map class CoreMapper. Two variables and one output label class will first need to be initialized.

  • The initial value of the variable one directly set to 1, indicating a word appeared in the text.
  • The first two parameters are the Map function input parameters of the function, value type Text means reads each line of text, key type Object, refers to a line data input line number in the text.

StringTokenizer class machine method of the present value of the variable Chinese split line of text, the word in the split tokenizer list. The program then cycles through each of the word processing, words in the label in the word count as one.

Throughout the execution of the function, the value of one is always 1. In this example, key is not used to significantly. Map output context is a function, the tag is used, the intermediate results are stored directly therein.

According to the above code, after the Map task, the output of the three files shown in Table 2.

The results in Table 2 word count Map task outputs
Filename / Map file001/Map1 file002/Map2 file003 / Map3
Map task outputs <“Hello”,1>
<“world”,1>
<“Connected”,1>
<“world”,1>
<“One”,1>
<“world”,1>
<“One”,1>
<“dream”,1>
<“Hello”,1>
<“Hadoop”,1>
<“Hello”,1>
<“Map”,1>
<“Hello”, 1>
<“Reduce”,1>

Reduce writing program

Program in MapReduce second task is to write Reduce program. In the word count task, the task is to complete the Reduce digital input sequence summing the results to obtain the number of occurrences of each word.

After completion of the implementation of Map function, will enter Shuffle stage, in this stage, the MapReduce output frame will automatically sort and Map phase partitioning, and then distributed to the appropriate task to Reduce process. Map after the end result shown in Table 3. Shuffle stage.

Map Table count stage output terminal 3 word Shuffle
Filename / Map file001/Map1 file002/Map2 file003 / Map3
Map end
Shuffle output stage
<“Connected”,1>
<“Hello”, 1>
<“world”,<1,1>>
<“dream”,1>
<“One”, <1, 1>>
<“world”, 1>
<“Map”, 1>
<“Hadoop”,1>
<“Hello”,<1,1,1>>
<“Reduce”, 1>

Reduce Map After receiving respective ends of the data sent will be combined, i.e. the same key, i.e. the same word key-value pairs are combined, a <key, <V1, V2, .. Vn >> form output. After the end result of Map Shuffle phase shown in Table 4.

Table Reduce end count output stage 4 word Shuffle
Reduce 端 <“Connected”,1>
Shuffle output stage < “dream”,1>
<“Hadoop”,1>
<“Hello”,<1,1,1,1>>
<“Map”,1>
<“One”,<1,1>>
<“world”, <1,1,1>>
<“Reduce”, 1>

Reduce phase needs to be processed to the data to obtain the number of occurrences of each word. Reduce function input may have been appreciated Reduce function needs to be done is to first digital data value in the input sequence summing. The following is a Reduce task implementation code word count program.

  1. public static class CoreReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
  2.  
  3. private IntWritable count = new IntWritable ();
  4. public void reduce(Text key,Iterable<IntWritable> values,Reducer<Text,IntWritable,Text,IntWritable> Context context)throws IOException, InterruptedException {
  5. int sum = 0;
  6. for (IntWritable intWritable : values){
  7. sum += intWritable.get();
  8. }
  9. count.set(sum);
  10. context.write(key, count);
  11. }
  12. }

Map and achieve similar tasks, Reduce task is inherited class Reducer Hadoop to provide and implement its interface. Reduce function of the input, the output type is essentially the same type of Map Output Function.

Reduce function at the beginning of the first parameter is used to set the sum recorded number of occurrences of each word, then the list traversal value, and wherein the digital accumulates eventually can get the total number of occurrences of each word. On output, still use the type of context variables to store information. When the Reduce phase, it can be obtained a final result desired, as shown in Table 5.

Table 5 Reduce task output word count results
Reduce task outputs <“Connected”, 1>
  <“dream”, 1>
  <“Hadoop”, 1>
  <“Hello”, 4>
  <“Map”, 1>
  <“One”, 2>
  <“world”, 3>
  <“Reduce”, 1>

Write main function

In order to use CoreMapper CoreReducer classes and the actual data processing required by the Job class variable Hadoop MapReduce environment setting program is running in the main function, the following is a specific code.

  1. public static void main(String[] args) throws Exception {
  2. Configuration conf = new Configuration();
  3. String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();
  4. if (otherArgs.length != 2) {
  5. System.err.printIn(“Usage:wordcount <in> <out>”);
  6. System.exit(2);
  7. }
  8. Job job = new Job (conf, “WordCount”); //设置环境参数
  9. job.setJarByClass (WordCount.class); //设置程序的类名
  10. job.setMapperClass(CoreMapper.class); //添加 Mapper 类
  11. job.setReducerClass(CoreReducer.class); //添加 Reducer类
  12. job.setOutputKeyClass (Text.class); //设置输出 key 的类型
  13. job.setOutputValueClass (IntWritable.class);
  14. //设置输出 value 的类型
  15. FileInputFormat.addInputPath (job, new Path (otherArgs [0]));
  16. //设置输入文件路径
  17. FileOutputFormat.setOutputPath (job,new Path (otherArgs [1]));
  18. //设置输入文件路径
  19. System.exit(job.waitForCompletion(true) ? 0 : 1);
  20. }

代码首先检查参数是不是正确,如果不正确就提醒用户。随后,通过 Job 类设置环境参数,并设置程序的类、Mapper 类和 Reducer 类。然后,设置了程序的输出类型,也就是 Reduce 函数的输出结果 <key,value> 中 key 和 value 各自的类型。最后,根据程序运行时的参数,设置输入、输出文件路径。

核心代码包

编写 MapReduce 程序需要引用 Hadoop 的以下几个核心组件包,它们实现了 Hadoop MapReduce 框架。

  1. import java.io.IOException;
  2. import java.util.StringTokenizer;
  3. import org.apache.hadoop.conf.Configuration;
  4. import org.apache.hadoop.fs.Path;
  5. import org.apache.hadoop.io.IntWritable;
  6. import org.apache.hadoop.io.Text;
  7. import org.apache.hadoop.mapreduce.Job;
  8. import org.apache.hadoop.mapreduce.Mapper;
  9. import org.apache.hadoop.mapreduce.Reducer;
  10.  
  11. import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
  12. import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
  13. import org.apache.hadoop.util.GenericOptionsParser;

这些核心组件包的基本功能描述如表 6 所示。

表 6 Hadoop MapReduce 核心组件包的基本功能
功能
org.apache.hadoop.conf 定义了系统参数的配置文件处理方法
org.apache.hadoop.fs 定义了抽象的文件系统 API
org.apache.hadoop.mapreduce Hadoop MapReduce 框架的实现,包括任务的分发调度等
org.apache.hadoop.io 定义了通用的 I/O API,用于网络、数据库和文件数据对象 进行读写操作

运行代码

在运行代码前,需要先把当前工作目录设置为 /user/local/Hadoop。编译 WordCount 程序需要以下 3 个 Jar,为了简便起见,把这 3 个 Jar 添加到 CLASSPATH 中。

$export
CLASSPATH=/usr/local/hadoop/share/hadoop/common/hadoop-common-2.7.3.jar:$CLASSPATH
$export
CLASSPATH=/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-2.7.3.jar:$CLAS SPATH
$export
CLASSPATH=/usr/local/hadoop/share/hadoop/common/lib/common-cli-1.2.jar:$CLASSPATH

使用 JDK 包中的工具对代码进行编译。

$ javac WordCount.java

编译之后,在文件目录下可以发现有 3 个“.class”文件,这是 Java 的可执行文件,将它们打包并命名为 wordcount.jar。

$ jar -cvf wordcount.jar *.class

这样就得到了单词计数程序的 Jar 包。在运行程序之前,需要启动 Hadoop 系统,包括启动 HDFS 和 MapReduce。然后,就可以运行程序了。

$ ./bin/Hadoop jar wordcount.jar WordCount input output

最后,可以运行下面的命令查看结果。

$ ./bin/Hadoop fs -cat output/*

39.Spark简介
40.Spark RDD
41.Spark总体架构和运行流程
42.Spark生态圈
43.Spark开发实例
44.Spark Streaming简介

年薪40+W的大数据开发【教程】,都在这儿!

Guess you like

Origin blog.csdn.net/yuidsd/article/details/92170329