hadoop 小文件过多优化

大量小文件在mapreduce中的问题

  Map tasks通常是每次处理一个block的input(默认使用FileInputFormat)。如果文件非常的小,有大量的这种小文件,那么每一个map task都仅仅处理了非常小的input数据,并且会产生大量的map tasks,hadoop1默认是64M,hadoop2默认是128M,如果过多的小文件,会引起过多的Map tasks

  hadoop中有一些特性可以用来减轻这种问题:可以在一个JVM中允许task reuse,以支持在一个JVM中运行多个map task,以此来减少一些JVM的启动消耗(通过设置mapred.job.reuse.jvm.num.tasks属性,默认为1,-1为无限制)。另一种方法为使用MultiFileInputSplit,它可以使得一个map中能够处理多个split。

可以借助

Sequence Files

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.util.ReflectionUtils;

import java.io.IOException;


public class SequenceTest {

    public static final String output_path = "xxx";
    private static final String[] DATA = { "a", "b", "c", "d"};

    @SuppressWarnings("deprecation")
    public static void write(String pathStr) throws IOException {
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);
        Path path = new Path(pathStr);

        SequenceFile.Writer writer = SequenceFile.createWriter(fs, conf, path, Text.class, IntWritable.class);
        Text key = new Text();
        IntWritable value = new IntWritable();
        for(int i = 0; i < DATA.length; i++) {
            key.set(DATA[i]);
            value.set(i);
            System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key, value);
            writer.append(key, value);
        }
        IOUtils.closeStream(writer);
    }

    @SuppressWarnings("deprecation")
    public static void read(String pathStr) throws IOException {
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);
        Path path = new Path(pathStr);
        SequenceFile.Reader reader = new SequenceFile.Reader(fs, path, conf);

        Writable key = (Writable) ReflectionUtils.newInstance(reader.getKeyClass(), conf);
        Writable value = (Writable) ReflectionUtils.newInstance(reader.getValueClass(), conf);
        while (reader.next(key, value)) {
            System.out.printf("%s\t%s\n", key, value);
        }
        IOUtils.closeStream(reader);
    }

    public static void main(String[] args) throws IOException {
        write(output_path);
        read(output_path);
    }
}

更多原理

http://dongxicheng.org/mapreduce/hdfs-small-files-solution/

猜你喜欢

转载自blog.csdn.net/qq_33283716/article/details/81188864