这章主要是大数据(九)和大数据(十)的补充
如果是学习大数据的请仔细阅读之前的文章
大数据(九):https://blog.csdn.net/qq_34886352/article/details/82461919
大数据(十):https://blog.csdn.net/qq_34886352/article/details/82498134
一、Combiner合并
1.Combiner是MR程序中Mapper和Reducer之外的一个组件
2.Combiner组件的父类就是Reducer
3.Combiner和Reducer的区别在于运行的位置
- Combiner是在每一个MapTask所在的节点运行
- Reducer是接收全局所有Mapper的输出结果
4.Combiner的意义就是对每一个MapTask的输出进行局部汇总,以减少网络传输量。
5.Combiner能够应用的前提是不能影响最终的业务逻辑,而且Combiner的输出KV应该跟Reducer的输入KV类型对应起来。
二、自定义Combiner实现步骤
先前的业务逻辑与源码:(大数据(八)的标题七中)
https://blog.csdn.net/qq_34886352/article/details/82426534
方法1:
1.自定义一个Combiner继承Reducer,重写reducer方法
public class WordcountCombiner extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
//合并汇总
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
//输出
context.write(key,new IntWritable(sum));
}
}
2.在Driver部分的job中将Combiner指定为我们自定义的Combiner
//指定Combiner
job.setCombinerClass(WordcountCombiner.class);
三、辅助排序和二次排序实例(GroupingComparator)
1、需求
有如下订单数据
订单id |
商品id |
成交金额 |
0000001 |
Pdt_01 | 222.8 |
0000001 |
Pdt_06 |
25.8 |
0000002 |
Pdt_03 |
522.8 |
0000002 |
Pdt_04 |
122.4 |
0000002 |
Pdt_05 |
722.4 |
0000003 |
Pdt_01 |
222.8 |
0000003 |
Pdt_02 |
33.8 |
现在需要求出每一个订单中最贵的商品(每个订单生成一个输出(使用分区输出))
预期输出:
0000001 |
222.8 |
0000002 |
722.4 |
0000003 |
222.8 |
2.处理的思路
-
获取一行数据
-
切割出每个字段
-
一行封装成Bean对象
-
将Bean对象更具订单分组
-
每组对象从大到小排序
-
reduce方法只需要把每组key的一条数据输出
3.编写bean对象
public class OrderBean implements WritableComparable<OrderBean> {
/**
* 订单id
*/
private int orderId;
/**
* 价格
*/
private double price;
public OrderBean(int orderId, int price) {
this.orderId = orderId;
this.price = price;
}
@Override
public void write(DataOutput dataOutput) throws IOException {
dataOutput.writeInt(orderId);
dataOutput.writeDouble(price);
}
@Override
public void readFields(DataInput dataInput) throws IOException {
this.orderId = dataInput.readInt();
this.price = dataInput.readDouble();
}
public int getOrderId() {
return orderId;
}
public void setOrderId(int orderId) {
this.orderId = orderId;
}
public double getPrice() {
return price;
}
public void setPrice(double price) {
this.price = price;
}
@Override
public int compareTo(OrderBean o) {
int result;
if (orderId > o.getOrderId()) {
result = 1;
} else if (orderId < o.getOrderId()) {
result = -1;
} else {
result = price > o.getPrice() ? -1 : 1;
}
return result;
}
@Override
public String toString() {
return orderId + "/t" + price;
}
}
4.编写Mapper代码
public class OrderMapper extends Mapper<LongWritable,Text,OrderBean,NullWritable>{
OrderBean k = new OrderBean();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//获取一行数据
String line = value.toString();
//切割
String[] fields = line.split("\t");
//封装对象
k.setOrderId(Integer.parseInt(fields[0]));
k.setPrice(Double.parseDouble(fields[2]));
//输出
context.write(k,NullWritable.get());
}
}
5.编写Partitioner代码(输出分区)
public class OrderPartitioner extends Partitioner<OrderBean, NullWritable> {
@Override
public int getPartition(OrderBean orderBean, NullWritable nullWritable, int i) {
return (orderBean.getOrderId() & Integer.MAX_VALUE) % i;
}
}
6.编写Reduce代码
public class OrderReducer extends Reducer<OrderBean,NullWritable,OrderBean,NullWritable>{
@Override
protected void reduce(OrderBean key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
context.write(key,NullWritable.get());
}
}
7.编写GroupingComparator(辅助排序(分组),可以根据逻辑将数据视为一组 统一传给Reducer处理)
public class OrderGroupingComparator extends WritableComparator {
/**
* 分组的时候必须有个构造函数
*/
protected OrderGroupingComparator() {
super(OrderBean.class, true);
}
@Override
public int compare(WritableComparable a, WritableComparable b) {
OrderBean aBean = (OrderBean) a;
OrderBean bBean = (OrderBean) b;
//id相同就任务是同一个对象
int result;
if (aBean.getOrderId() > bBean.getOrderId()) {
result = 1;
} else if (aBean.getOrderId() < bBean.getOrderId()) {
result = -1;
} else {
result = 0;
}
return result;
}
}
8.编写Driver代码
public class OrderDriver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
//获取配置信息
Configuration conf=new Configuration();
Job job = Job.getInstance(conf);
//设置jar包加载路径
job.setJarByClass(OrderDriver.class);
//加载map/reduce类
job.setMapperClass(OrderMapper.class);
job.setReducerClass(OrderReducer.class);
//设置map输出数据key和value类型
job.setMapOutputKeyClass(OrderBean.class);
job.setMapOutputValueClass(NullWritable.class);
//设置最终输出数据key和value类型
job.setOutputKeyClass(OrderBean.class);
job.setOutputValueClass(NullWritable.class);
//设置输入数据和输出数据路径
FileInputFormat.setInputPaths(job,new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path(args[1]));
//设置reduce端分组
job.setGroupingComparatorClass(OrderGroupingComparator.class);
//设置分区
job.setPartitionerClass(OrderPartitioner.class);
job.setNumReduceTasks(3);
//提交
boolean result = job.waitForCompletion(true);
System.exit(result?0:1);
}
}
四、小文件处理实战(自定义InputFormat)
1.需求
无论hdfs还是MapReduce,对于小文件都有损效率,实践中,有难免面临处理大量小文件的场景,此时,就需要有相应解决方案。将多个小文件合并成一个文件SequenceFile,SequenceFile里面存储多个文件,存储的数据形式为文件路径+名称为key,内容为Value。
2.输入的数据
有3个文件:one.txt,two.txt,three.txt每个文件里面多个行,每行多个单词用\t隔开
3.程序分析
小文件优化有以下几种方案
-
在数据采集的时候,就将小文件或小批量数据合成大文件再上传HDFS。
-
在业务处理之前,在HDFS上使用MapReduce程序对象小文件进行合并。
-
在MapReduce处理时,使用CombineTextInputFormat提高效率。
4.具体实现
使用自定义InputFormat的方式,处理输入小文件的问题。
-
自定义一个类继承FileInputFormat
-
改写RecordReader,实现一次读取一个完整文件封装为KV
-
在输出时使用SequenceFileOutPutFormat输出合并文件
5.自定义InputFromat
public class WholeFileInputformat extends FileInputFormat<NullWritable, BytesWritable> {
@Override
public RecordReader<NullWritable, BytesWritable> createRecordReader(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
//创建对象
WholeRecordReader recordReader = new WholeRecordReader();
//初始化
recordReader.initialize(inputSplit,taskAttemptContext);
return recordReader;
}
@Override
protected boolean isSplitable(JobContext context, Path filename) {
return false;
}
}
6.自定义RecordReader
public class WholeRecordReader extends RecordReader<NullWritable, BytesWritable> {
BytesWritable bytesWritable = new BytesWritable();
boolean isProcess = false;
FileSplit split;
Configuration configuration;
@Override
public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
//初始化
this.split = (FileSplit) inputSplit;
this.configuration = taskAttemptContext.getConfiguration();
}
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
//读取一个一个的文件
if (!isProcess) {
//设置缓冲区
byte[] buf = new byte[(int) split.getLength()];
FileSystem fileSystem = null;
FSDataInputStream fis = null;
try {
//获取文件系统
Path path = split.getPath();
fileSystem = path.getFileSystem(configuration);
//打开文件输入流
fis = fileSystem.open(path);
//流拷贝
IOUtils.readFully(fis, buf, 0, buf.length);
//拷贝缓冲区的数据到最终输出
bytesWritable.set(buf, 0, buf.length);
} catch (Exception e) {
e.printStackTrace();
} finally {
IOUtils.closeStream(fis);
IOUtils.closeStream(fileSystem);
}
isProcess = true;
return true;
}
return false;
}
@Override
public NullWritable getCurrentKey() throws IOException, InterruptedException {
return NullWritable.get();
}
@Override
public BytesWritable getCurrentValue() throws IOException, InterruptedException {
return bytesWritable;
}
@Override
public float getProgress() throws IOException, InterruptedException {
return isProcess ? 1 : 0;
}
@Override
public void close() throws IOException {
}
}
7.编写Mapper
public class SequenceFileMapper extends Mapper<NullWritable,BytesWritable,Text,BytesWritable>{
Text k = new Text();
@Override
protected void setup(Context context) throws IOException, InterruptedException {
//获取切片信息
FileSplit split = (FileSplit)context.getInputSplit();
//获取文件的路径和文件名称
Path path = split.getPath();
k.set(path.toString());
}
@Override
protected void map(NullWritable key, BytesWritable value, Context context) throws IOException, InterruptedException {
context.write(k,value);
}
}
8.编写Reducer
public class SequenceFileReducer extends Reducer<Text,BytesWritable,Text,BytesWritable>{
@Override
protected void reduce(Text key, Iterable<BytesWritable> values, Context context) throws IOException, InterruptedException {
for (BytesWritable value : values) {
context.write(key,value);
}
}
}
9.编写Driver
public class SequenceFileDriver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
//获取配置信息
Configuration conf=new Configuration();
Job job = Job.getInstance(conf);
//设置jar包加载路径
job.setJarByClass(SequenceFileDriver.class);
//加载map/reduce类
job.setMapperClass(SequenceFileMapper.class);
job.setReducerClass(SequenceFileReducer.class);
//设置InputFormat和OutFormat
job.setInputFormatClass(WholeFileInputformat.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
//设置map输出数据key和value类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(BytesWritable.class);
//设置最终输出数据key和value类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(BytesWritable.class);
//设置输入数据和输出数据路径
FileInputFormat.setInputPaths(job,new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path(args[1]));
//提交
boolean result = job.waitForCompletion(true);
System.exit(result?0:1);
}
}
10.运行程序,记得给虚拟机传递文件位置和输出位置