Hadoop MapReduce Introductory Example

1. Preparation

  1. Downloaded the latest 3.1.2 version of hadoop from hadoop official website
  2. Configure hadoop-related environment variables
export HADOOP_HOME=/work/dev_tools/hadoop-3.1.2
export PATH=$HADOOP_HOME/bin:$PATH:.
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar

2. MapReduce code example

Function: Given a text, count the word frequency of all words

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.util.StringTokenizer;

/**
 * @author lvsheng
 * @date 2019-09-01
 **/
public class WordCount {
    
    
	
	public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
    
    
		
		private final static IntWritable one  = new IntWritable(1);
		private              Text        word = new Text();
		
		@Override
		public void map(Object key, Text value, Context context
		) throws IOException, InterruptedException {
    
    
			StringTokenizer itr = new StringTokenizer(value.toString());
			while (itr.hasMoreTokens()) {
    
    
				word.set(itr.nextToken());
				context.write(word, one);
			}
		}
	}
	
	public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    
    
		
		private IntWritable result = new IntWritable();
		
		@Override
		public void reduce(Text key, Iterable<IntWritable> values,
						   Context context
		) throws IOException, InterruptedException {
    
    
			int sum = 0;
			for (IntWritable val : values) {
    
    
				sum += val.get();
			}
			result.set(sum);
			context.write(key, result);
		}
	}
	
	public static void main(String[] args) throws Exception {
    
    
		long          start = System.currentTimeMillis();
		Configuration conf  = new Configuration();
		Job job = Job.getInstance(conf, "word count");
		job.setJarByClass(WordCount.class);
		job.setMapperClass(TokenizerMapper.class);
		job.setCombinerClass(IntSumReducer.class);
		job.setReducerClass(IntSumReducer.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		FileInputFormat.addInputPath(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		job.waitForCompletion(true);
		System.out.println("time cost : " + (System.currentTimeMillis() - start) / 1000 + " s");
	}
}

Note: My class is placed in the default package download, and there is no package path. In the case of a package path, an error will be reported during execution, which will be described in detail later.

3. Run the job

  1. compile this class first
hadoop com.sun.tools.javac.Main WordCount.java
  1. Make the compiled bytecode file into a jar package
jar cf WordCount.jar WordCount*.class
  1. run the program
hadoop jar WordCount.jar WordCount /Users/lvsheng/Movies/aclImdb/train/pos /temp/output2

The input file I gave was relatively large, and it took more than an hour for the program to run on a single machine before the results came out.
insert image description here

a small problem encountered

When my job class has a package path, it consistently reports that the class cannot be found when running the program, no matter whether the class path is added or not.

Execution command with path:

✗ hadoop jar WordCount.jar com.alibaba.ruzun.WordCount /Users/lvsheng/Movies/aclImdb/train/pos /temp/output2

Error stack:

Exception in thread "main" java.lang.ClassNotFoundException: com.alibaba.ruzun.WordCount
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:348)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:311)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:232)

It is even more impossible to remove the package path in the command. Since it is caused by the package path, simply move the job class to the java file, so that there is no package path, and the problem is solved. As for how it is caused, we will check it later when we study in depth.

Guess you like

Origin blog.csdn.net/bruce128/article/details/100531206