大数据（十一）：Shuffle机制（Combiner合并、辅助排序(分组)GroupingComparator）与小文件处理实战（自定义InputFormat）

这章主要是大数据（九）和大数据（十）的补充

如果是学习大数据的请仔细阅读之前的文章

大数据（九）：https://blog.csdn.net/qq_34886352/article/details/82461919

大数据（十）：https://blog.csdn.net/qq_34886352/article/details/82498134

一、Combiner合并

1.Combiner是MR程序中Mapper和Reducer之外的一个组件

2.Combiner组件的父类就是Reducer

3.Combiner和Reducer的区别在于运行的位置

Combiner是在每一个MapTask所在的节点运行
Reducer是接收全局所有Mapper的输出结果

4.Combiner的意义就是对每一个MapTask的输出进行局部汇总，以减少网络传输量。

5.Combiner能够应用的前提是不能影响最终的业务逻辑，而且Combiner的输出KV应该跟Reducer的输入KV类型对应起来。

二、自定义Combiner实现步骤

先前的业务逻辑与源码：（大数据（八）的标题七中）

https://blog.csdn.net/qq_34886352/article/details/82426534
方法1：

1.自定义一个Combiner继承Reducer，重写reducer方法

public class WordcountCombiner extends Reducer<Text, IntWritable, Text, IntWritable> {
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        //合并汇总
        int sum = 0;
        for (IntWritable value : values) {
            sum += value.get();
        }
        //输出
        context.write(key,new IntWritable(sum));
    }
}

2.在Driver部分的job中将Combiner指定为我们自定义的Combiner

//指定Combiner
job.setCombinerClass(WordcountCombiner.class);

三、辅助排序和二次排序实例（GroupingComparator）

1、需求

有如下订单数据

订单id	商品id	成交金额
0000001	Pdt_01	222.8
0000001	Pdt_06	25.8
0000002	Pdt_03	522.8
0000002	Pdt_04	122.4
0000002	Pdt_05	722.4
0000003	Pdt_01	222.8
0000003	Pdt_02	33.8

现在需要求出每一个订单中最贵的商品（每个订单生成一个输出（使用分区输出））

预期输出：

0000001	222.8
0000002	722.4
0000003	222.8

2.处理的思路

获取一行数据
切割出每个字段
一行封装成Bean对象
将Bean对象更具订单分组
每组对象从大到小排序
reduce方法只需要把每组key的一条数据输出

3.编写bean对象

public class OrderBean implements WritableComparable<OrderBean> {
    /**
    * 订单id
    */
    private int orderId;
    /**
    * 价格
    */
    private double price;

    public OrderBean(int orderId, int price) {
        this.orderId = orderId;
        this.price = price;
    }

    @Override
    public void write(DataOutput dataOutput) throws IOException {
        dataOutput.writeInt(orderId);
        dataOutput.writeDouble(price);
    }

    @Override
    public void readFields(DataInput dataInput) throws IOException {
        this.orderId = dataInput.readInt();
        this.price = dataInput.readDouble();
    }

    public int getOrderId() {
        return orderId;
    }

    public void setOrderId(int orderId) {
        this.orderId = orderId;
    }

    public double getPrice() {
        return price;
    }

    public void setPrice(double price) {
        this.price = price;
    }

    @Override
    public int compareTo(OrderBean o) {
        int result;
        if (orderId > o.getOrderId()) {
            result = 1;
        } else if (orderId < o.getOrderId()) {
            result = -1;
        } else {
            result = price > o.getPrice() ? -1 : 1;
        }
        return result;
    }

    @Override
    public String toString() {
        return orderId + "/t" + price;
    }
}

4.编写Mapper代码

public class OrderMapper extends Mapper<LongWritable,Text,OrderBean,NullWritable>{
    OrderBean k = new OrderBean();
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        //获取一行数据
        String line = value.toString();
        //切割
        String[] fields = line.split("\t");
        //封装对象
        k.setOrderId(Integer.parseInt(fields[0]));
        k.setPrice(Double.parseDouble(fields[2]));
        //输出
        context.write(k,NullWritable.get());
    }
}

5.编写Partitioner代码（输出分区）

public class OrderPartitioner extends Partitioner<OrderBean, NullWritable> {
    @Override
    public int getPartition(OrderBean orderBean, NullWritable nullWritable, int i) {
        return (orderBean.getOrderId() & Integer.MAX_VALUE) % i;
    }
}

6.编写Reduce代码

public class OrderReducer extends Reducer<OrderBean,NullWritable,OrderBean,NullWritable>{
    @Override
    protected void reduce(OrderBean key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
        context.write(key,NullWritable.get());
    }
}

7.编写GroupingComparator（辅助排序（分组），可以根据逻辑将数据视为一组统一传给Reducer处理）

public class OrderGroupingComparator extends WritableComparator {
    /**
    * 分组的时候必须有个构造函数
    */
    protected OrderGroupingComparator() {
        super(OrderBean.class, true);
    }

    @Override
    public int compare(WritableComparable a, WritableComparable b) {
    OrderBean aBean = (OrderBean) a;
    OrderBean bBean = (OrderBean) b;
    //id相同就任务是同一个对象
    int result;
    if (aBean.getOrderId() > bBean.getOrderId()) {
        result = 1;
    } else if (aBean.getOrderId() < bBean.getOrderId()) {
        result = -1;
    } else {
        result = 0;
    }
    return result;
    }
}

8.编写Driver代码

public class OrderDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        //获取配置信息
        Configuration conf=new Configuration();
        Job job = Job.getInstance(conf);
        //设置jar包加载路径
        job.setJarByClass(OrderDriver.class);
        //加载map/reduce类
        job.setMapperClass(OrderMapper.class);
        job.setReducerClass(OrderReducer.class);
        //设置map输出数据key和value类型
        job.setMapOutputKeyClass(OrderBean.class);
        job.setMapOutputValueClass(NullWritable.class);
        //设置最终输出数据key和value类型
        job.setOutputKeyClass(OrderBean.class);
        job.setOutputValueClass(NullWritable.class);
        //设置输入数据和输出数据路径
        FileInputFormat.setInputPaths(job,new Path(args[0]));
        FileOutputFormat.setOutputPath(job,new Path(args[1]));
        //设置reduce端分组
        job.setGroupingComparatorClass(OrderGroupingComparator.class);
        //设置分区
        job.setPartitionerClass(OrderPartitioner.class);
        job.setNumReduceTasks(3);
        //提交
        boolean result = job.waitForCompletion(true);
        System.exit(result?0:1);
    }
}

四、小文件处理实战（自定义InputFormat）

1.需求

无论hdfs还是MapReduce，对于小文件都有损效率，实践中，有难免面临处理大量小文件的场景，此时，就需要有相应解决方案。将多个小文件合并成一个文件SequenceFile，SequenceFile里面存储多个文件，存储的数据形式为文件路径+名称为key，内容为Value。

2.输入的数据

有3个文件：one.txt，two.txt，three.txt每个文件里面多个行，每行多个单词用\t隔开

3.程序分析

小文件优化有以下几种方案

在数据采集的时候，就将小文件或小批量数据合成大文件再上传HDFS。
在业务处理之前，在HDFS上使用MapReduce程序对象小文件进行合并。
在MapReduce处理时，使用CombineTextInputFormat提高效率。

4.具体实现

使用自定义InputFormat的方式，处理输入小文件的问题。

自定义一个类继承FileInputFormat
改写RecordReader，实现一次读取一个完整文件封装为KV
在输出时使用SequenceFileOutPutFormat输出合并文件

5.自定义InputFromat

public class WholeFileInputformat extends FileInputFormat<NullWritable, BytesWritable> {
    @Override
    public RecordReader<NullWritable, BytesWritable> createRecordReader(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
        //创建对象
        WholeRecordReader recordReader = new WholeRecordReader();
        //初始化
        recordReader.initialize(inputSplit,taskAttemptContext);
        return recordReader;
    }

    @Override
    protected boolean isSplitable(JobContext context, Path filename) {
        return false;
    }
}

6.自定义RecordReader

public class WholeRecordReader extends RecordReader<NullWritable, BytesWritable> {
    BytesWritable bytesWritable = new BytesWritable();
    boolean isProcess = false;
    FileSplit split;
    Configuration configuration;

    @Override
    public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
        //初始化
        this.split = (FileSplit) inputSplit;
        this.configuration = taskAttemptContext.getConfiguration();
    }

    @Override
    public boolean nextKeyValue() throws IOException, InterruptedException {
        //读取一个一个的文件
        if (!isProcess) {
            //设置缓冲区
            byte[] buf = new byte[(int) split.getLength()];
            FileSystem fileSystem = null;
            FSDataInputStream fis = null;
            try {
                //获取文件系统
                Path path = split.getPath();
                fileSystem = path.getFileSystem(configuration);
                //打开文件输入流
                fis = fileSystem.open(path);
                //流拷贝
                IOUtils.readFully(fis, buf, 0, buf.length);
                //拷贝缓冲区的数据到最终输出
                bytesWritable.set(buf, 0, buf.length);
            } catch (Exception e) {
                e.printStackTrace();
            } finally {
                IOUtils.closeStream(fis);
                IOUtils.closeStream(fileSystem);
            }
            isProcess = true;
            return true;
        }
        return false;
    }

    @Override
    public NullWritable getCurrentKey() throws IOException, InterruptedException {
        return NullWritable.get();
    }

    @Override
    public BytesWritable getCurrentValue() throws IOException, InterruptedException {
        return bytesWritable;
    }

    @Override
    public float getProgress() throws IOException, InterruptedException {
        return isProcess ? 1 : 0;
    }

    @Override
    public void close() throws IOException {
    }
}

7.编写Mapper

public class SequenceFileMapper extends Mapper<NullWritable,BytesWritable,Text,BytesWritable>{
    Text k = new Text();
    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
        //获取切片信息
        FileSplit split = (FileSplit)context.getInputSplit();
        //获取文件的路径和文件名称
        Path path = split.getPath();
        k.set(path.toString());
    }

    @Override
    protected void map(NullWritable key, BytesWritable value, Context context) throws IOException, InterruptedException {
        context.write(k,value);
    }
}

8.编写Reducer

public class SequenceFileReducer extends Reducer<Text,BytesWritable,Text,BytesWritable>{
    @Override
    protected void reduce(Text key, Iterable<BytesWritable> values, Context context) throws IOException, InterruptedException {
        for (BytesWritable value : values) {
            context.write(key,value);
        }
    }
}

9.编写Driver

public class SequenceFileDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        //获取配置信息
        Configuration conf=new Configuration();
        Job job = Job.getInstance(conf);
        //设置jar包加载路径
        job.setJarByClass(SequenceFileDriver.class);
        //加载map/reduce类
        job.setMapperClass(SequenceFileMapper.class);
        job.setReducerClass(SequenceFileReducer.class);
        //设置InputFormat和OutFormat
        job.setInputFormatClass(WholeFileInputformat.class);
        job.setOutputFormatClass(SequenceFileOutputFormat.class);
        //设置map输出数据key和value类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(BytesWritable.class);
        //设置最终输出数据key和value类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(BytesWritable.class);
        //设置输入数据和输出数据路径
        FileInputFormat.setInputPaths(job,new Path(args[0]));
        FileOutputFormat.setOutputPath(job,new Path(args[1]));
        //提交
        boolean result = job.waitForCompletion(true);
        System.exit(result?0:1);
    }
}

10.运行程序，记得给虚拟机传递文件位置和输出位置

大数据（十一）：Shuffle机制（Combiner合并、辅助排序(分组)GroupingComparator）与小文件处理实战（自定义InputFormat）

猜你喜欢