The first MapReduce program written by hand--wordcount

Quote:

    I have run the first example wordcount that comes with hadoop before. This time we handwritten one by ourselves. This is equivalent to the helloworld in the programming language.
First of all, let's understand which part of the MapReduce we want to write. We know that hadoop To process a file, the file to be processed is first divided into many parts, processed separately, and finally the results are aggregated to
form the final processing result. (That is, the idea of ​​divide and conquer) Let's take a word statistics next For example, see which part of the entire MapReduce process is the code we wrote.

Specific MapReduce process example

First of all, we have such a file, the content of the file is as follows:
hello world hello java
hello hadoop
is very simple, a file is two lines. So how does hadoop do word statistics? Let's describe it with steps:
Step 1: Read this File, split the words in each line of this file by line, and then form a lot of key/value results. After processing, it is like this
<hello,1>
<world,1>
<hello,1>
<java,1>
<hello,1>
<hadoop,1>
Step 2: Sorting After
sorting, it will become like this result
<hadoop,1>
<hello,1>
<hello,1>
<hello,1>
<java,1>
<world ,1>
Step 3: Merge
The result of the merge is as follows
<hadoop,1>
<hello,1,1,1>
<java,1>
<world,1>
Step 4: Converge the result
<hadoop,1>
<hello ,3>
<java,1>
<world,1>

When the fourth step is completed, the word statistics are actually completed. After reading this specific example, I must have a clearer understanding of the mapreduce processing process.
Then we need to know that the second and third steps are the hadoop framework To help us complete, where we actually need to write code is the first and fourth steps. The
first step corresponds to the Map process, and the fourth step corresponds to the Reduce process.

Write mapreduce code

Now what we have to do is to complete the first and fourth steps of the code
1. Create a project


Create an ordinary java project, and then click next all the way, and choose the project name by yourself.
2. Import the hadoop used at the time The package, I am using the hadoop-3.2.0 version here, which packages need to be imported? The packages
to be imported:
(1) The packages under share/hadoop/common in the hadoop directory (except the test package, the official (Test example, you don’t need to introduce)
(2). The package in common under the same lib as the previous one
(3) . The package under share/hadoop/mapreduce under the hadoop directory
(4). The same lib under mapreduce as the previous one the package
is then introduced these packages in the idea, click on File-> Project Structure-> Modules
click on the small plus sign to the right of the introduced just said that the jar package

3. After the introduction of the package is completed, we create a java file called WordCount, and then start typing the code.
Here to paste the code directly, __ pay attention to the import part, is it the same as me? __ Because there are many classes with the same name, Comes from different jars, it is easy to make mistakes.

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.util.StringTokenizer;

/**
 * @author wxwwt
 * @since 2019-09-15
 */
public class WordCount {

    /**
     * Object      : 输入文件的内容
     * Text        : 输入的每一行的数据
     * Text        : 输出的key的类型
     * IntWritable : 输出value的类型
     */
    private static class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> {
        @Override
        protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
                context.write(new Text(itr.nextToken()), new IntWritable(1));
            }
        }
    }

    /**
     * Text         :  Mapper输入的key
     * IntWritable  :  Mapper输入的value
     * Text         :  Reducer输出的key
     * IntWritable  :  Reducer输出的value
     */
    private static class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
        @Override
        protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int count = 0;
            for (IntWritable item : values) {
                count += item.get();
            }
            context.write(key, new IntWritable(count));
        }
    }

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        // 创建配置
        Configuration configuration = new Configuration();
        // 设置hadoop的作业  jobName是WordCount
        Job job = Job.getInstance(configuration, "WordCount");
        // 设置jar
        job.setJarByClass(WordCount.class);
        // 设置Mapper的class
        job.setMapperClass(WordCountMapper.class);
        // 设置Reducer的class
        job.setReducerClass(WordCountReducer.class);
        // 设置输出的key和value类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        // 设置输入输出路径
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        // 待job执行完  程序退出
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }

}

Mapper program:

/**
 * Object      : 输入文件的内容
 * Text        : 输入的每一行的数据
 * Text        : 输出的key的类型
 * IntWritable : 输出value的类型
 */
private static class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> {
    @Override
    protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
        StringTokenizer itr = new StringTokenizer(value.toString());
        while (itr.hasMoreTokens()) {
            context.write(new Text(itr.nextToken()), new IntWritable(1));
        }
    }
}

The context is the global context. First, the StringTokenizer is used to divide the value (that is, the data of each line) into many parts according to the spaces. If the StringTokenizer does not pass in the specified separator, the default will be
"\t\n\r\f "space tab newline symbol as a delimiter, and then use the nextToken () to traverse a string of space-separated according .context.write (new Text (itr.nextToken () ), new IntWritable (1));
the It means to write the key/value into the context.
Note: In Hadoop programming, String is Text, and Integer is IntWritable. This is a class encapsulated by Hadoop itself.
Just remember, it is almost the same as the original class. Enter the word whose key is Text, and the value is 1 (statistical quantity) of Writable.

Reduce program:

/**
 * Text         :  Mapper输入的key
 * IntWritable  :  Mapper输入的value
 * Text         :  Reducer输出的key
 * IntWritable  :  Reducer输出的value
 */
private static class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int count = 0;
        for (IntWritable item : values) {
            count += item.get();
        }
        context.write(key, new IntWritable(count));
    }
}

Reduce finishes the fourth step. We look at the example process above and we know that the input parameters at this time are probably like this
<hello,1,1,1>,
so there will be a process of traversing the values, which is The three ones add up.

Program entry:

public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
    // 创建配置
    Configuration configuration = new Configuration();
    // 设置hadoop的作业  jobName是WordCount
    Job job = Job.getInstance(configuration, "WordCount");
    // 设置jar
    job.setJarByClass(WordCount.class);
    // 设置Mapper的class
    job.setMapperClass(WordCountMapper.class);
    // 设置Reducer的class
    job.setReducerClass(WordCountReducer.class);
    // 设置输出的key和value类型
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    // 设置输入输出路径
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    // 待job执行完  程序退出
    System.exit(job.waitForCompletion(true) ? 0 : 1);
}

The program entry here is actually clearer by looking at the comments. It is all about setting some parameters and paths required by mapreduce, just
write it accordingly. Here is a little attention

FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

When we review the first program of Hadoop before running, the command is probably
hadoop jar WordCount.jar /input/wordcount/file1 /output/wcoutput
The two parameters behind are the input path and output path of the file, if our code is modified If the position of the parameter or the operation with other parameters, it is
necessary to correspond to the position of the args subscript.

4. Specify the entry point for the jar package to run.
After the code is completed, we can package it.
First select File -> Project Structure -> Artifacts -> + -> JAR -> From modules with dependencies,

then select the main of WordCount just now

and click Build -> Build Artifacts

will then pop up a box to select Build and

then generate an out directory in the project, find the WordCount.jar we need in it, and upload it to the server where Hadoop is located.
This is basically the end, because the following steps and The same as in my previous article, you can refer to: hadoop run the first instance wordcount

Precautions:

It is possible to run directly

  hadoop jar WordCount.jar /input/wordcount/file1  /output/wcoutput

Will fail and report an exception:

Exception in thread "main" java.io.IOException: Mkdirs failed to create /XXX/XXX
  at org.apache.hadoop.util.RunJar.ensureDirectory(RunJar.java:106)
  at org.apache.hadoop.util.RunJar.main(RunJar.java:150)

Similar to the above.

At this time, you need to delete the License folder and the contents in the jar package. You can refer to this link: stackoverflow
View the files and folders of the license in the
jar jar tvf XXX.jar | grep -i license
and delete META-INF/ The contents of LICENSE
zip -d XXX.jar META-INF/LICENSE

to sum up:

1. Understand the operation steps of mapReduce, so that we know that we only need to write the process of map and reduce. The intermediate step has been processed by the hadoop framework. In the future, other programs can also refer to this step to write
2. String in hadoop is Text, Integer is IntWritable. Remember, if you use it wrong, it will report an exception.
3. When you report Mkdirs failed to create /XXX/XXX, check if there is a problem with the path. If not, delete the META-INF/ in the jar package. LICENSE

Reference materials:

1.https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
2.https://stackoverflow.com/questions/10522835/hadoop-java-io-ioexception-mkdirs-failed-to-create-some-path

Guess you like

Origin blog.csdn.net/sc9018181134/article/details/100865699