High-frequency test sites for data warehouse interviews--Solve the problem of too many small files in hive

This article was first published on the public account: Learn Big Data in Five Minutes

Reasons for small files

The small files in hive must be generated when data is imported into the hive table, so let’s first look at several ways to import data into hive

  1. Insert data directly into the table
insert into table A values (1,'zhangsan',88),(2,'lisi',61);

This method generates a file every time it is inserted, and inserts a small amount of data multiple times will result in multiple small files. However, this method is rarely used in the production environment and can be said to be basically unused.

  1. Load data through load
load data local inpath '/export/score.csv' overwrite into table A  -- 导入文件

load data local inpath '/export/score' overwrite into table A   -- 导入文件夹

Use the load method to import files or folders. When importing a file, the hive table has a file. When importing a folder, the number of files in the hive table is the number of all files in the folder

  1. Load data by query
insert overwrite table A  select s_id,c_name,s_score from B;

This method is commonly used in production environments, and it is also the easiest way to generate small files

MR task will be started when insert data is imported, and output as many files as there are reduce in MR

Therefore, the number of files = the number of Reduce Tasks * the number of partitions

There are also many simple tasks that do not have reduce but only map phase, then

Number of files = number of MapTask * number of partitions

At least one file is generated in hive every time insert is executed, because there is at least one MapTask when insert is imported.
For example, some businesses need to synchronize data to hive every 10 minutes, so a lot of files will be generated.

The impact of too many small files

  1. First of all, for the underlying storage HDFS, HDFS itself is not suitable for storing a large number of small files. Too many small files will cause the namenode metadata to be extremely large, occupy too much memory, and seriously affect the performance of HDFS.
  2. For hive, when querying, each small file will be treated as a block and start a Map task to complete, and the startup and initialization time of a Map task is much longer than the logic processing time, which will cause a lot of resources waste. Moreover, the number of maps that can be executed at the same time is limited.

How to solve too many small files

1. Use the concatenate command that comes with hive to automatically merge small files

Instructions:

#对于非分区表
alter table A concatenate;

#对于分区表
alter table B partition(day=20201224) concatenate;

For example:

#向 A 表中插入数据
hive (default)> insert into table A values (1,'aa',67),(2,'bb',87);
hive (default)> insert into table A values (3,'cc',67),(4,'dd',87);
hive (default)> insert into table A values (5,'ee',67),(6,'ff',87);

#执行以上三条语句,则A表下就会有三个小文件,在hive命令行执行如下语句
#查看A表下文件数量
hive (default)> dfs -ls /user/hive/warehouse/A;
Found 3 items
-rwxr-xr-x   3 root supergroup        378 2020-12-24 14:46 /user/hive/warehouse/A/000000_0
-rwxr-xr-x   3 root supergroup        378 2020-12-24 14:47 /user/hive/warehouse/A/000000_0_copy_1
-rwxr-xr-x   3 root supergroup        378 2020-12-24 14:48 /user/hive/warehouse/A/000000_0_copy_2

#可以看到有三个小文件,然后使用 concatenate 进行合并
hive (default)> alter table A concatenate;

#再次查看A表下文件数量
hive (default)> dfs -ls /user/hive/warehouse/A;
Found 1 items
-rwxr-xr-x   3 root supergroup        778 2020-12-24 14:59 /user/hive/warehouse/A/000000_0

#已合并成一个文件

Note:  
1. The concatenate command only supports RCFILE and ORC file types.  
2. When using the concatenate command to merge small files, you cannot specify the number of merged files, but you can execute the command multiple times.  
3. The number of files does not change after using concatenate multiple times. This is related to the setting of the parameter mapreduce.input.fileinputformat.split.minsize=256mb, and the minimum size of each file can be set.

2. Adjust the parameters to reduce the number of maps

  • Set the relevant parameters of map input and merge small files :
#执行Map前进行小文件合并
#CombineHiveInputFormat底层是 Hadoop的 CombineFileInputFormat 方法
#此方法是在mapper中将多个文件合成一个split作为输入
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; -- 默认

#每个Map最大输入大小(这个值决定了合并后文件的数量)
set mapred.max.split.size=256000000;   -- 256M

#一个节点上split的至少的大小(这个值决定了多个DataNode上的文件是否需要合并)
set mapred.min.split.size.per.node=100000000;  -- 100M

#一个交换机下split的至少的大小(这个值决定了多个交换机上的文件是否需要合并)
set mapred.min.split.size.per.rack=100000000;  -- 100M
  • Set the relevant parameters for merging map output and reduce output :
#设置map端输出进行合并,默认为true
set hive.merge.mapfiles = true;

#设置reduce端输出进行合并,默认为false
set hive.merge.mapredfiles = true;

#设置合并文件的大小
set hive.merge.size.per.task = 256*1000*1000;   -- 256M

#当输出文件的平均大小小于该值时,启动一个独立的MapReduce任务进行文件merge
set hive.merge.smallfiles.avgsize=16000000;   -- 16M 
  • Enable compression
# hive的查询结果输出是否进行压缩
set hive.exec.compress.output=true;

# MapReduce Job的结果输出是否使用压缩
set mapreduce.output.fileoutputformat.compress=true;

3. Reduce the number of Reduce

#reduce 的个数决定了输出的文件的个数,所以可以调整reduce的个数控制hive表的文件数量,
#hive中的分区函数 distribute by 正好是控制MR中partition分区的,
#然后通过设置reduce的数量,结合分区函数让数据均衡的进入每个reduce即可。

#设置reduce的数量有两种方式,第一种是直接设置reduce个数
set mapreduce.job.reduces=10;

#第二种是设置每个reduce的大小,Hive会根据数据总大小猜测确定一个reduce个数
set hive.exec.reducers.bytes.per.reducer=5120000000-- 默认是1G,设置为5G

#执行以下语句,将数据均衡的分配到reduce中
set mapreduce.job.reduces=10;
insert overwrite table A partition(dt)
select * from B
distribute by rand();

解释:如设置reduce数量为10,则使用 rand(), 随机生成一个数 x % 10 ,
这样数据就会随机进入 reduce 中,防止出现有的文件过大或过小

4. Use hadoop's archive to archive small files

Hadoop Archive is abbreviated as HAR, which is a file archive tool that efficiently puts small files into HDFS blocks. It can pack multiple small files into one HAR file, so that while reducing the memory usage of the namenode, it still allows transparent files. Access

#用来控制归档是否可用
set hive.archive.enabled=true;
#通知Hive在创建归档时是否可以设置父目录
set hive.archive.har.parentdir.settable=true;
#控制需要归档文件的大小
set har.partfile.size=1099511627776;

#使用以下命令进行归档
ALTER TABLE A ARCHIVE PARTITION(dt='2020-12-24', hr='12');

#对已归档的分区恢复为原文件
ALTER TABLE A UNARCHIVE PARTITION(dt='2020-12-24', hr='12');

Note: The  
archived partition can be viewed and cannot be inserted overwrite, and must be unarchive first

At last

If it is a new cluster and there are no historical problems, it is recommended that hive use the orc file format and enable lzo compression.
So too many small files can be merged quickly using hive's own command concatenate.

If you want to get more big data-related technical articles, you can follow the official account: learn big data in five minutes , focus on big data technical research, and share high-quality original technical articles


Guess you like

Origin blog.51cto.com/14932245/2586568