一、需求分析
-
需求
在给定的文本文件中统计输出每一个单词出现的总次数
-
按照
MapReduce
编程规范,编写Mapper
(1)将
MapTask
传给我们的文本内容先转换成String
(2)根据空格将这一行切分成单词
(3)将单词输出为
<k,v>
格式 -
按照
MapReduce
编程规范,编写Reducer
(1)汇总各个
key
的个数(2)输出该
key
的总次数 -
按照
MapReduce
编程规范,编写Driver
(1)获取配置信息,获取
job
对象实例(2)指定本程序的
jar
包所在的本地路径(3)关联
MapReducer
业务类(4)指定Mapper输出数据的
<k,v>
类型(5)指定最终输出数据的
<k,v>
类型(6)指定
job
的输入原始文件所在目录(7)指定
job
的输出结果所在目录(8)提交作业
二、环境准备
-
创建一个名为
mrWordCount
的Maven
工程 -
在
pom.xml
文件中添加如下依赖<dependencies> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>RELEASE</version> </dependency> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-core</artifactId> <version>2.8.2</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.7.2</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.7.2</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>2.7.2</version> </dependency> </dependencies>
-
在项目的
src/main/resources
目录下,新建一个文件,命名为log4j.properties
,在文件中填入log4j.rootLogger=INFO, stdout log4j.appender.stdout=org.apache.log4j.ConsoleAppender log4j.appender.stdout.layout=org.apache.log4j.PatternLayout log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n log4j.appender.logfile=org.apache.log4j.FileAppender log4j.appender.logfile.File=target/spring.log log4j.appender.logfile.layout=org.apache.log4j.PatternLayout log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n
三、编写程序
本文使用idea进行相关操作:
-
创建包名:
com.easysir.wordcount
-
创建
WordcountMapper
类package com.easysir.wordcount; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import java.io.IOException; // 其中LongWritable类型为输入数据行的偏移量 public class WordcountMapper extends Mapper<LongWritable, Text, Text, IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(1); @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { // 1 获取一行 String line = value.toString(); // 2 按空格切割 String[] words = line.split(" "); // 3 输出结果 for (String word : words) { k.set(word); context.write(k, v); } } }
-
创建
WordcountReducer
类package com.easysir.wordcount; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; import java.io.IOException; public class WordcountReducer extends Reducer<Text, IntWritable, Text, IntWritable> { int sum; IntWritable v = new IntWritable(); @Override protected void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException { // 1 累加求和 sum = 0; for (IntWritable count : values) { sum += count.get(); } // 2 输出 v.set(sum); context.write(key,v); } }
-
创建
WordcountDriver
类package com.easysir.wordcount; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import java.io.IOException; public class WordcountDriver { public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { // 1 获取配置信息以及封装任务 Configuration configuration = new Configuration(); Job job = Job.getInstance(configuration); // 2 设置jar加载路径 job.setJarByClass(WordcountDriver.class); // 3 设置map和reduce类 job.setMapperClass(WordcountMapper.class); job.setReducerClass(WordcountReducer.class); // 4 设置map输出 job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); // 5 设置最终输出kv类型 job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); // 6 设置输入和输出路径 FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); // 7 提交 boolean result = job.waitForCompletion(true); System.exit(result ? 0 : 1); } }
四、本地测试
-
填写路径参数
注意:输出路径文件夹不能为已存在的文件夹,否则会报错
-
运行程序
-
查看结果
easysir 2 haha 2 heihei 1 hello 2 nihao 1 wanghu 1
五、集群测试
-
添加Maven打包插件依赖,注意修改
WordcountDriver
路径<build> <plugins> <plugin> <artifactId>maven-compiler-plugin</artifactId> <version>2.3.2</version> <configuration> <source>1.8</source> <target>1.8</target> </configuration> </plugin> <plugin> <artifactId>maven-assembly-plugin </artifactId> <configuration> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> <archive> <manifest> <mainClass>com.easysir.wordcount.WordcountDriver</mainClass> </manifest> </archive> </configuration> <executions> <execution> <id>make-assembly</id> <phase>package</phase> <goals> <goal>single</goal> </goals> </execution> </executions> </plugin> </plugins> </build>
-
将程序打成jar包
-
将jar包拷贝到
Hadoop
集群中,选择无依赖jar包 -
启动
Hadoop
集群 -
执行
WordCount
程序# hadoop jar jar包 启动类 输入路径 输出路径 hadoop jar ./mrWordCount-1.0-SNAPSHOT.jar com.easysir.wordcount.WordcountDriver /2020 /output
-
查看结果
hadoop fs -cat /output/part-r-00000