hdfs large number of small file compression

Following the file compression in the previous issue, two problems were encountered:

  1. Some logs are large in volume and exceed 100G per day. Therefore, it takes too long to pull data locally through ftp. This method is not advisable.
  2. Because the logs are collected in real time by Spark Streaming, the data distribution is not uniform. The large ones are 5G and the small ones are only tens of MB. In this way, even after compression, a large number of small files will have a great impact on the reading performance of HDFS.

For this situation, the following solutions are given:

1. Archive

Archive this part of the data first. Archiving can merge a large number of small files without changing the amount of data.


Insert image description here
After the original data is archived,
Insert image description here
Insert image description here
the data is evenly distributed.

2. Compression

Due to the large amount of data, the ftp method is not advisable, so a program is used for compression.

/**
 * Created by wpq on 2019/3/27.
 * 压缩测试
 */
public class HdfsCompressTest {
    public static void main(String[] args) throws ClassNotFoundException, IOException {
        SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
        System.out.println("stattime:" + df.format(new Date()));
        Class<?> cal = Class.forName("org.apache.hadoop.io.compress.GzipCodec");
        Configuration conf = new Configuration();
        conf.set("fs.hdfs.impl", "org.apache.hadoop.hdfs.DistributedFileSystem");
        FileSystem fs = FileSystem.get(conf);
        CompressionCodec codec = (CompressionCodec) ReflectionUtils.newInstance(cal, conf);
        //指定压缩和被压缩的文件
        String inputfile = "/user/hdfs/rsync";
        //String inputfile = "/user/hdfs/wpq/tmp";
        RemoteIterator<LocatedFileStatus> files = fs.listFiles(new Path(inputfile), true);
        while (files.hasNext()) {
            LocatedFileStatus next = files.next();
            if (next.isFile()) {
                Path path = next.getPath();
                if (path.toString().endsWith("gz")) {
                } else {
                    //指定压缩输入流
                    FSDataInputStream in = fs.open(path);
                    //指定压缩输出流
                    String outfile = path.toString() + codec.getDefaultExtension();//添加压缩格式为后缀
                    FSDataOutputStream outputStream = fs.create(new Path(outfile));
                    CompressionOutputStream out = codec.createOutputStream(outputStream);
                    IOUtils.copyBytes(in, out, 4096, false);
                    in.close();
                    out.close();
                    fs.delete(path, true);
                }
            }
        }
        System.out.println("endtime:" + df.format(new Date()));
    }
}

3. Verification

Extract 100 pieces of data on the 1st and 15th of each month as well as the total number of data on that day, compare them, verify ok, and delete the source data.

Guess you like

Origin blog.csdn.net/qq_42264264/article/details/88885740