[Intelligent Big Data Analysis] Experiment 1 MapReduce Experiment: Word Counting

Article directory

[Intelligent Big Data Analysis] Experiment 1 MapReduce Experiment: Word Counting

In one of my previous blogs: Big Data Processing in Cloud Computing: Trying the Application of HDFS and MapReduce , I had a similar operation. If you don’t know how to do it, you can check it out in this blog.

1. Experimental purpose

Based on the MapReduce idea, write the WordCount program.

2. Experimental requirements

1. Understand MapReduce programming ideas;

2. Be able to write the MapReduce version WordCount;

3. Will execute the program;

4. Analyze the execution process by yourself.

3. Experimental principles

MapReduce is a computing model that simply puts a large batch of work (data) into decomposition (MAP) execution, and then merges the results into the final result (REDUCE). The advantage of this is that after the task is decomposed, parallel computing can be performed on a large number of machines, reducing the time of the entire operation.

Scope of application: The amount of data is large, but the type of data is small and can be put into memory.

Basic principles and key points: hand over data to different machines for processing, divide data, and reduce results.

Understanding MapReduce and Yarn: In the new version of Hadoop, Yarn, as a resource management and scheduling framework, is the living environment for MapReduce programs to run under Hadoop. In fact, in addition to running under the Yarn framework, MapRuduce can also run on scheduling frameworks such as Mesos and Corona. Using different scheduling frameworks requires different adaptations for Hadoop.

The execution process of a completed MapReduce program in Yarn is as follows:

(1) ResourcManager JobClient submits a job to ResourcManager.

(2) ResourcManager requests a container for MRAppMaster to run from Scheduler, and then starts it.

(3) MRAppMaster registers with ResourceManager after starting up.

(4) ResourcManagerJobClient obtains MRAppMaster-related information from ResourcManager, and then communicates directly with MRAppMaster.

(5) MRAppMaster calculates splits and constructs resource requests for all maps.

(6) MRAppMaster does some necessary preparation work for MR OutputCommitter.

(7) MRAppMaster initiates a resource request to RM (Scheduler), obtains a set of containers for map/reduce tasks to run, and then works with NodeManager to perform some necessary tasks for each container, including resource localization, etc.

(8) MRAppMaster monitors the running task until it is completed. When the task fails, it applies for a new container to run the failed task.

(9) After each map/reduce task is completed, MRAppMaster runs the cleanup code of MR OutputCommitter, which is to perform some finishing work.

(10) When all map/reduce is completed, MRAppMaster runs the necessary job commit or abort APIs of OutputCommitter.

(11) MRAppMaster exits.

1 MapReduce programming

When writing MapReduce programs that rely on the Yarn framework for execution in Hadoop, you do not need to develop MRAppMaster and YARNRunner yourself, because Hadoop already provides general YARNRunner and MRAppMaster programs by default. In most cases, you only need to write the corresponding Map processing and Reduce processing processes. program.

Writing a MapReduce program is not complicated. The key point is to master distributed programming ideas and methods. The calculation process is mainly divided into the following five steps:

(1) Iteration. Traverse the input data and parse it into key/value pairs.

(2) Map the input key/value pairs into other key/value pairs.

(3) Group the intermediate data according to the key.

(4) Reduce the data in groups.

(5) Iteration. Save the final key/value pair to the output file.

2 Java API analysis

(1) InputFormat: used to describe the format of input data. TextInputFormat is commonly used to provide the following two functions:

Data segmentation: Split the input data into several splits according to a certain strategy to determine the number of Map Tasks and the corresponding splits.

Provide data to Mapper: given a split, it can be parsed into key/value pairs.

(2) OutputFormat: Used to describe the format of output data. It can write the key/value pairs provided by the user into a file in a specific format.

(3) Mapper/Reducer: Mapper/Reducer encapsulates the data processing logic of the application.

(4) Writable: Hadoop’s custom serialization interface. The interface that implements this class can be used as value data in the MapReduce process.

(5) WritableComparable: Based on Writable, it inherits the Comparable interface. The interface that implements this class can be used as key data in the MapReduce process. (Because key contains comparison and sorting operations).

4. Experimental steps

This experiment is mainly divided into confirming the preliminary preparation, writing the MapReduce program, and packaging and submitting the code. The steps to view the running results are as follows:

1 Start Hadoop

Insert image description here

2 Verify that there is no wordcount folder on HDFS

Insert image description here

At this time, there should be no wordcount folder on HDFS.

3 Upload data files to HDFS

wordcount.txt:
Insert image description here

4. Write MapReduce program

Mainly write Map and Reduce classes. The Map process needs to inherit the Mapper class in the org.apache.hadoop.mapreduce package and rewrite its map method; the Reduce process needs to inherit the Reduce class in the org.apache.hadoop.mapreduce package and rewrite it. Its reduce method.

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.partition.HashPartitioner;

import java.io.IOException;
import java.util.StringTokenizer;


public class WordCount {
    
    
    public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
    
    
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();
        //map方法，划分一行文本，读一个单词写出一个<单词,1>
        public void map(Object key, Text value, Context context)throws IOException, InterruptedException {
    
    
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
    
    
                word.set(itr.nextToken());
                context.write(word, one);//写出<单词,1>
            }
        }
    }
    //定义reduce类，对相同的单词，把它们中的VList值全部相加
    public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    
    
        private IntWritable result = new IntWritable();
        public void reduce(Text key, Iterable<IntWritable> values,Context context)
                throws IOException, InterruptedException {
    
    
            int sum = 0;
            for (IntWritable val : values) {
    
    
                sum += val.get();//相当于<Hello,1><Hello,1>，将两个1相加
            }
            result.set(sum);
            context.write(key, result);//写出这个单词，和这个单词出现次数<单词，单词出现次数>
        }
    }
    public static void main(String[] args) throws Exception {
    
    //主方法，函数入口
        Configuration conf = new Configuration();           //实例化配置文件类
        Job job = new Job(conf, "WordCount");             //实例化Job类
        job.setInputFormatClass(TextInputFormat.class);     //指定使用默认输入格式类
        TextInputFormat.setInputPaths(job, args[0]);      //设置待处理文件的位置
        job.setJarByClass(WordCount.class);               //设置主类名
        job.setMapperClass(TokenizerMapper.class);        //指定使用上述自定义Map类
        job.setCombinerClass(IntSumReducer.class);        //指定开启Combiner函数
        job.setMapOutputKeyClass(Text.class);            //指定Map类输出的，K类型
        job.setMapOutputValueClass(IntWritable.class);     //指定Map类输出的，V类型
        job.setPartitionerClass(HashPartitioner.class);       //指定使用默认的HashPartitioner类
        job.setReducerClass(IntSumReducer.class);         //指定使用上述自定义Reduce类
        job.setNumReduceTasks(Integer.parseInt(args[2]));  //指定Reduce个数
        job.setOutputKeyClass(Text.class);                //指定Reduce类输出的,K类型
        job.setOutputValueClass(Text.class);               //指定Reduce类输出的,V类型
        job.setOutputFormatClass(TextOutputFormat.class);  //指定使用默认输出格式类
        TextOutputFormat.setOutputPath(job, new Path(args[1]));    //设置输出结果文件位置
        System.exit(job.waitForCompletion(true) ? 0 : 1);    //提交任务并监控任务状态
    }
}

Insert image description here

5 Use commands to package the code

The above code will report an error when compiled and run:
Insert image description here

Mainly in Hadoop version 3.x, Jobconstructors are deprecated and Job.getInstanceconstructors are required. Also, there is a potential issue with the output type of your Reduce class being set job.setOutputValueClassto , but the two need to match.Text.classIntWritable

Here is the modified code:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.util.StringTokenizer;

public class WordCount {
    
    

    public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
    
    
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
    
    
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
    
    
                word.set(itr.nextToken());
                context.write(word, one);
            }
        }
    }

    public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    
    
        private IntWritable result = new IntWritable();

        public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
    
    
            int sum = 0;
            for (IntWritable val : values) {
    
    
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }

    public static void main(String[] args) throws Exception {
    
    
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "WordCount");
        job.setJarByClass(WordCount.class);

        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        FileInputFormat.addInputPath(job, new Path(args[0])); // 输入路径
        FileOutputFormat.setOutputPath(job, new Path(args[1])); // 输出路径

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

The following is the packaging process:

srcCreate a folder named in the root directory of the java project we created .
Move all Java source code files ( .java) into srcthe folder.
Create a file named in the project root directory Manifest.txtthat specifies the entry point for the JAR file.
In Manifest.txtthe file, add the following content:
```
Main-Class: <Main-Class>
```
Replace <Main-Class>with the full class name of the main class that contains mainthe method, for example mine isSalesDriver
Back to the project root directory, use the following command to compile the Java source code and create a temporary directory to save the compiled class files:
```
mkdir classes
javac -d classes src/*.java
```
If you have problems when using the compilation command 程序包×××存在, we need to add the Hadoop-related jar files to the compilation path to solve it:
```
javac -classpath /usr/local/servers/hadoop/share/hadoop/common/h

adoop-common-3.1.3.jar:/usr/local/servers/hadoop/share/hadoop/mapreduce/hadoop-map

reduce-client-core-3.1.3.jar -d classes src/*.java
```
Note that the above command is one and not multiple.
Create an empty JAR file named WordCount.jar:
```
jar -cvf WordCount.jar -C classes/ .
```

Add the compiled class files Manifest.txtto the JAR file:

jar -uf WordCount.jar -C classes/ .

jar -uf WordCount.jar Mainfest.txt

Up to now, our entire java project has been packaged successfully.

6 Submit the jar file to run the MapReduce job on the Hadoop cluster

We will WordCount.jarsubmit the packaged package to the cluster using the following command:

hadoop jar WordCount.jar WordCount /user/wordcount.txt /wordcount

After successful execution, the terminal will print the following information:

Insert image description here

Then we view our output directory:

hdfs dfs -ls /wordcount

Insert image description here

The red box shows the result we need, we download it to view:

hdfs dfs -get /wordcount1/part-r-00000 /root/WordCount
vim part-r-00000

Insert image description here
We can see that the results we want are obtained, and this experiment is over.