一.MapReduce 简介
MapReduce作为Hadoop的三大组件(功能上分)之一,主要为提供大数据平台的分布式计算,虽然比较臃肿,只适合处理离线处理,但是对于理解spark等框架的原理架构会有很大帮助。
二.WordCount案例编写
为了测试方便,因此直接在windows10本地测试本案例
1.准备阶段
1)数据准备
wordCountdemo.rar 解压到某个文夹下,例如本例中解压到:D:\mktest
2)Jar包准备(Maven配置)
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.6</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.7.6</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>2.7.6</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-common</artifactId>
<version>2.7.6</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-yarn</artifactId>
<version>2.7.6</version>
</dependency>
</dependencies>
即:
hadoop-common
,hadoop-hdfs
,hadoop-mapreduce-client-core
,hadoop-mapreduce-client-common
和hadoop-yarn
(Maven 形式的话,会自动下载其所依赖的jar包)
3)加入log4j.properties日志配置文件(src下面)
###set log levels###
log4j.rootLogger=info, stdout
###output to the console###
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.Target=System.out
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=[%d{dd/MM/yy HH:mm:ss:SSS z}] %t %5p %c{2}: %m%n
2.WordCount代码实现
结构:
自定义Mapper
,自定义Reducer
,Driver
1)自定义Mapper类
创建WordCountMapper类,继承Mapper类
由于五个文件中都是以tab键分割的。
package com.mycat.mapreduce.wordcount;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class WordCountMapper extends Mapper<LongWritable, Text,Text, IntWritable> {
// 定义变量作为map输出的key
Text sendKey=new Text();
// 定义变量作为map输出的value
IntWritable sendValue=new IntWritable();
//由于map方法的调用频率是每一行,即按行调用,故粒度操作可细化为对一行的操作
/**
* 参数一(key):每一行的行首偏移量,与每一行的缩占用字节数量息息相关.
* 参数二(value):即每一行具体的内容
* 参数三(context):上下文对象,上用来承接前面框架接口,下来作为向下一层输出的接口
* LongWritable,Text是实现了可序列化接口的类(分别对应java的long和String类型)
* 之所以要传递序列化的类型是因为分布式计算需要通过网络实现数据的传输,
* 为什么不使用Java默认的Serializable接口?
* java默认的序列化接口的好处是兼容性强,但是序列化与反序列化性能方面却很差,Hadoop默认采用
* Writable接口实现对象的序列化和反序列化。
*/
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//针对,每一行的多个单词按照tab键进行分割
String[] lines = value.toString().split("\t");
//对每一个单词做一个标记,设置其值为1,然后通过context对象向下一层传递
for (String word : lines) {
sendKey.set(word);
sendValue.set(1);
context.write(sendKey,sendValue);
}
}
}
2)自定义Reducer类
创建WordCountReducer类,继承Reducer类
注意:Text类型所在包是
org.apache.hadoop.io.Text
package com.mycat.mapreduce.wordcount;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
/**
* reducer的调用频率是每组一次(按照分组)
* 参数1(key):类型必须和Mapper的key类型保持一致,其值对应于mapper的key
* 参数2(values):对mapper的value进行排序分组聚合后的迭代器对象
* 参数3(context):上下文对象,承上启下,节结果交给下层输出接口
*/
public class WordCountReducer extends Reducer<Text, IntWritable,Text,IntWritable> {
/**
*
* @param key 例如 hello
* @param values 1,1,1,1,1
* @param context 上下文对象
*/
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
//定义来汇总一组值的变量
int sum=0;
//迭代遍历values
for (IntWritable value : values) {
sum+=value.get();
}
context.write(key, new IntWritable(sum));
}
}
3) 创建Driver驱动类
package com.mycat.mapreduce.wordcount;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class Driver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
//通过Configuration创建作业对象
Configuration conf = new Configuration();
Job job=Job.getInstance(conf);
//指定打成jar包后主类入口
job.setJarByClass(Driver.class);
//指定Mapper类
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
//指定自定义的mapper类的输出键值类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
//指定自定义reducer类的输出键值类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(("D:\\mktest\\wordCountdemo")));
FileSystem fs = FileSystem.get(conf);
Path path=new Path("D://mktest/wordcount");
//输出目录要求不能存在,不然会报错,下面判断为:如果该目录存在该目录直接级联删除(方便测试)
if (fs.exists(path)) {
fs.delete(path, true);
}
FileOutputFormat.setOutputPath(job, path);
//提交作业
job.waitForCompletion(true);
}
}
3.结果展示
1) 控制台输出
13/03/19 20:14:30:415 CST] main INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local965530918_0001
.................................................
.................................................
[13/03/19 20:14:31:562 CST] main INFO mapreduce.Job: map 100% reduce 100%
[13/03/19 20:14:31:563 CST] main INFO mapreduce.Job: Job job_local965530918_0001 completed successfully
[13/03/19 20:14:31:573 CST] main INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=14818
FILE: Number of bytes written=1758716
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=33
Map output records=77
Map output bytes=723
Map output materialized bytes=907
Input split bytes=500
Combine input records=0
Combine output records=0
Reduce input groups=13
Reduce shuffle bytes=907
Reduce input records=77
Reduce output records=13
Spilled Records=154
Shuffled Maps =5
Failed Shuffles=0
Merged Map outputs=5
GC time elapsed (ms)=0
Total committed heap usage (bytes)=3019898880
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=438
File Output Format Counters
Bytes Written=117
2)输出结果文件夹
# D:\mktest\wordcount 的目录
2019/03/13 20:14 12 .part-r-00000.crc
2019/03/13 20:14 8 ._SUCCESS.crc
2019/03/13 20:14 105 part-r-00000
2019/03/13 20:14 0 _SUCCESS
3)输出文件介绍
.part-r-00000.crc
:结果文件的校验文件
._SUCCESS.crc
:结果成功标识文件的校验文件
part-r-00000
:输出结果文件(因为默认只有一个reducetask所以只有一个结果输出文件)
_SUCCESS
:成功标识文件
4) 结果输出文件查看(第一行那个数字格式是模拟测试数据时不小心保存错了,但是一样测试
)
结果输出格式:
单词
出现次数
00:0c:29:16:90 1
fer 4
fhieu 4
fjeir 4
fjir 4
fre 8
hdf 8
hdfs 4
hds 4
hello 16
hfureh 4
word 4
world 12