hive solves small file problem

Causes of small files

The small files in hive must be generated when data is imported into the hive table, so let’s first look at several ways to import data into hive.

1. Insert data directly into the table

insert into table A values (1,'zhangsan',88),(2,'lisi',61);

This method will generate a file every time it is inserted. If you insert a small amount of data multiple times, multiple small files will appear. However, this method is rarely used in production environments. It can be said that it is basically not used.

2. Load data through load method

load data local inpath '/export/score.csv' overwrite into table A  -- 导入文件
load data local inpath '/export/score' overwrite into table A   -- 导入文件夹

Use the load method to import files or folders. When a file is imported, the hive table has one file. When a folder is imported, the number of files in the hive table is the number of all files in the folder. 3. Load data by
query

insert overwrite table A  select s_id,c_name,s_score from B;

This method is commonly used in production environments and is also the easiest way to generate small files.
When inserting data, the MR task will be started. The number of files in MR will be output as many as there are reducers.
Therefore, the number of files = the number of ReduceTasks and the number of partitions
are also simple. The task has no reduce, only map phase, then
the number of files = the number of MapTasks and the number of
partitions
. Every time insert is executed, at least one file is generated in hive, because there will be at least one MapTask when insert is imported.
For example, some businesses need to synchronize data to hive every 10 minutes, which will generate a lot of files.

The impact of too many small files

  1. First of all, for the underlying storage HDFS, HDFS itself is not suitable for storing a large number of small files. Too many small files will cause the namenode metadata to be extremely large, occupy too much memory, and seriously affect the performance of HDFS.
  2. For hive, when querying, each small file will be treated as a block and a Map task will be started to complete it. The startup and initialization time of a Map task is much longer than the logical processing time, which will cause a lot of resources. waste. Moreover, the number of Maps that can be executed at the same time is limited.
    How to solve too many small files
  3. Use the concatenate command that comes with hive to automatically merge small files.
    Usage:
    #For non-partitioned tables
alter table A concatenate;

#For partitioned tables

alter table B partition(day=20201224) concatenate;

Example:
#Insert data into table A

hive (default)> insert into table A values (1,'aa',67),(2,'bb',87);
hive (default)> insert into table A values (3,'cc',67),(4,'dd',87);
hive (default)> insert into table A values (5,'ee',67),(6,'ff',87);

#Execute the above three statements, then there will be three small files under table A. Execute the following statement on the hive command line
#View the number of files under table A

hive (default)> dfs -ls /user/hive/warehouse/A;
Found 3 items
-rwxr-xr-x   3 root supergroup        378 2020-12-24 14:46 /user/hive/warehouse/A/000000_0
-rwxr-xr-x   3 root supergroup        378 2020-12-24 14:47 /user/hive/warehouse/A/000000_0_copy_1
-rwxr-xr-x   3 root supergroup        378 2020-12-24 14:48 /user/hive/warehouse/A/000000_0_copy_2

#You can see that there are three small files, and then use concatenate to merge them

hive (default)> alter table A concatenate;

#Check the number of files under table A again

hive (default)> dfs -ls /user/hive/warehouse/A;
Found 1 items
-rwxr-xr-x   3 root supergroup        778 2020-12-24 14:59 /user/hive/warehouse/A/000000_0

#Merge into one file
Note:
1. The concatenate command only supports RCFILE and ORC file types.
2. When using the concatenate command to merge small files, you cannot specify the number of merged files, but you can execute the command multiple times.
3. When concatenate is used multiple times, the number of files does not change. This is related to the setting of the parameter mapreduce.input.fileinputformat.split.minsize=256mb, which can set the minimum size of each file.

2. Adjust parameters to reduce the number of Maps

• Set parameters related to map input and merge small files:

#Combine small files before executing Map
#CombineHiveInputFormat bottom layer is Hadoop's CombineFileInputFormat method
#This method combines multiple files into one split in mapper as input
set hive.input.format=org.apache.hadoop.hive.ql.io .CombineHiveInputFormat; - Default
#The maximum input size of each Map (this value determines the number of merged files)
set mapred.max.split.size=256000000; - 256M
#The minimum size of split on a node (this value determines the number of merged files) Whether files on multiple DataNodes need to be merged)
set mapred.min.split.size.per.node=100000000; – 100M
#The minimum size of split under a switch (this value determines whether files on multiple switches need to be merged) )
set mapred.min.split.size.per.rack=100000000; – 100M

• Set relevant parameters for merging map output and reduce output:

#Set the map-side output for merging, the default is true
set hive.merge.mapfiles = true;
#Set the reduce-side output for merging, the default is false
set hive.merge.mapredfiles = true;
#Set the size of the merged file
set hive.merge .size.per.task = 256 1000 1000; – 256M
#When the average size of the output file is less than this value, start an independent MapReduce task for file merge
set hive.merge.smallfiles.avgsize=16000000; – 16M
• Enable compression
Whether the query result output of hive is compressed
set hive.exec.compress.output=true;
whether the result output of MapReduce Job is compressed
set mapreduce.output.fileoutputformat.compress=true;

3. Reduce the number of Reduces

#The number of reduce determines the number of output files, so you can adjust the number of reduce to control the number of files in the hive table. #The
partition function distribute by in hive happens to control the partition partition in MR.
#Then set the reduce The number can be combined with the partition function to allow the data to enter each reduce in a balanced manner.
#There are two ways to set the number of reducers. The first is to directly set the number of reducers
set mapreduce.job.reduces=10;
#The second is to set the size of each reduce. Hive will guess and determine a reduce based on the total size of the data. Number
set hive.exec.reducers.bytes.per.reducer=5120000000; - The default is 1G, set to 5G
#Execute the following statement to evenly distribute the data to reduce
set mapreduce.job.reduces=10;
insert overwrite table A partition(dt)
select * from B
distribute by rand();
Explanation: If the number of reduce is set to 10, use rand() to randomly generate a number x % 10,
so that the data will enter the reduce randomly to prevent any file is too large or too small

4. Use hadoop’s archive to archive small files

Hadoop Archive, referred to as HAR, is a file archiving tool that efficiently places small files into HDFS blocks. It can package multiple small files into a HAR file, which reduces the memory usage of the namenode while still allowing the files to be transparent. Access
#Used to control whether the archive is available
set hive.archive.enabled=true;
#Notify Hive whether the parent directory can be set when creating an archive
set hive.archive.har.parentdir.settable=true;
#Control the size of the archive file required
set har.partfile.size=1099511627776;
#Use the following command to archive
ALTER TABLE A ARCHIVE PARTITION(dt='2020-12-24', hr='12');
#Restore the archived partition to the original file
ALTER TABLE A UNARCHIVE PARTITION(dt='2020-12-24', hr='12');
Note:
The archived partition can be viewed but not insert overwrite. It must be unarchived first.
Finally
, if it is a new cluster and there are no historical issues, it is recommended to use hive. orc file format, and enable lzo compression.
If there are too many small files, you can use hive's built-in command concatenate to quickly merge them.

Map操作之前合并小文件:
set mapred.max.split.size=2048000000
#每个Map最大输入大小设置为2GB(单位:字节)
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
#执行Map前进行小文件合并
输出时进行合并:
set hive.merge.mapfiles = true
#在Map-only的任务结束时合并小文件
set hive.merge.mapredfiles= true
#在Map-Reduce的任务结束时合并小文件
set hive.merge.size.per.task = 1024000000
#合并后文件的大小为1GB左右
set hive.merge.smallfiles.avgsize=1024000000
#当输出文件的平均大小小于1GB时,启动一个独立的map-reduce任务进行文件merge
如果需要压缩输出文件,就需要增加一个压缩编解码器,同时还有两个压缩方式和多种压缩编码器,压缩方式一个是压缩输出结果,一个是压缩中间结果,按照自己的需求选择,我需要的是gzip就选择的GzipCodec,同时也可以选择使用BZip2Codec、SnappyCodec、LzopCodec进行压缩。
压缩文件:
set hive.exec.compress.output=true;
#默认false,是否对输出结果压缩
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
#压缩格式设置
set mapred.output.compression.type=BLOCK;
#一共三种压缩方式(NONE, RECORD,BLOCK),BLOCK压缩率最高,一般用BLOCK。

Insert image description here

Guess you like

Origin blog.csdn.net/m0_37759590/article/details/132510799