hive small file merge

    The hive warehouse table data is ultimately stored on HDFS. Due to the characteristics of Hadoop, the processing of large files is very efficient. In addition, large files can reduce the file metadata information and reduce the storage pressure of the NameNode. However, in a data warehouse, the higher the level of aggregation, the smaller the amount of data, and these tables usually have date partitions, and the number of files in HDFS will gradually increase over time.

1. Problems caused by small files

  • HDFS files are packed with data blocks and meta information, where meta information includes location, size, block and other information, which are stored in the memory of the NameNode. Each object occupies about 150 bytes, so 10 million files and blocks will occupy about 3G of memory space. Once approaching this magnitude, the performance of the NameNode will begin to decline.
  • HDFS also takes more time to read and write small files, because each time it needs to obtain meta information from the NameNode, and the corresponding DataNode establishes a connection. For MapReduce programs, small files will increase the number of Mappers, and each Map task will only process a small amount of data, wasting a lot of scheduling time.

2. Reasons for the generation of Hive small files

    On the one hand, the data volume of the summary table in the hive data warehouse is usually much less than that of the source data, and in order to improve the operation speed, we will increase the number of Reducers, and Hive itself will also do similar optimizations—the number of Reducers is equal to the number of source data. Amount divided by the amount configured by hive.exec.reducers.bytes.per.reduce (default 1G). An increase in the number of Reduces means an increase in the resulting files, resulting in the problem of small files.

    Solving the problem of small files can start from two directions:

  • Enter merge. That is, merge small files before map.
  • Output merged. That is, merge small files when outputting results.

3. Configure Map input merging

-- The maximum input size of each Map, which determines the number of merged files
set mapred.max.split.size=256000000;
-- The minimum size of a split on a node determines whether files on multiple data nodes need to be merged
set mapred.min.split.size.per.node=100000000;
-- The minimum size of split under a switch determines whether files on multiple switches need to be merged
set mapred.min.split.size.per.rack=100000000;
-- Merge small files before executing Map
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

 Fourth, configure hive result merging

    Merge the result file after the execution by setting the hive configuration item:

  • set hive.merge.mapfiles = true #Merge small files at the end of Map-only tasks
  • set hive.merge.mapredfiles = true #Merge small files at the end of Map-Reduce tasks
  • set hive.merge.size.per.task = 256*1000*1000 #The size of the merged file
  • set hive.merge.smallfiles.avgsize=16000000 #When the average size of the output file is less than this value, start an independent map-reduce task for file merge

     Hive will execute an additional map-only script when merging the result files. The number of mappers is the value obtained by dividing the total file size by the size.per.task parameter. The conditions for triggering the merge are: according to different query types, the corresponding The mapfiles/mapredfiles parameter needs to be turned on; the average size of the resulting files needs to be larger than the value of the avgsize parameter.

-- map-red job, 5 reducers, producing 5 60K files.
create table dw_stage.zj_small as
select paid, count (*)
from dw_db.dw_soj_imp_dtl
where log_dt = '2014-04-14'
group by paid;
-- Execute an additional map-only job, a mapper, that produces a 300K file.
set hive.merge.mapredfiles= true;
create table dw_stage.zj_small as
select paid, count (*)
from dw_db.dw_soj_imp_dtl
where log_dt = '2014-04-14'
group by paid;
-- map-only job, 45 mappers, generating 45 files of about 25M.
create table dw_stage.zj_small as
select *
from dw_db.dw_soj_imp_dtl
where log_dt = '2014-04-14'
and paid like '%baidu%' ;
-- Execute additional map-only jobs, 4 mappers, and generate 4 files of about 250M.
set hive.merge.smallfiles.avgsize=100000000;
create table dw_stage.zj_small as
select *
from dw_db.dw_soj_imp_dtl
where log_dt = '2014-04-14'
and paid like '%baidu%' ;

 Five, the processing of compressed files

    For the case where the output result is stored in the form of a compressed file, to solve the problem of small files, if it is merged before the map input, there is no restriction on the storage format of the output file. However, if output merging is used, it must be stored with SequenceFile, otherwise it cannot be merged. The following is an example:

set mapred.output.compression.type=BLOCK;
set hive.exec.compress.output= true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.LzoCodec;
set hive.merge.smallfiles.avgsize=100000000;
drop table if exists dw_stage.zj_small;
create table dw_stage.zj_small
STORED AS SEQUENCEFILE
as select *
from dw_db.dw_soj_imp_dtl
where log_dt = '2014-04-14'
and paid like '%baidu%' ;

 6. Use HAR archives

    Hadoop's archive file format is also one of the ways to solve the small file problem. And hive provides native support:

set hive.archive.enabled= true;
set hive.archive.har.parentdir.settable= true;
set har.partfile.size=1099511627776;
ALTER TABLE srcpart ARCHIVE PARTITION(ds= '2008-04-08', hr= '12' );
ALTER TABLE srcpart UNARCHIVE PARTITION(ds= '2008-04-08', hr= '12' );

    If you are not using a partitioned table, you can create an external table and use the har:// protocol to specify the path.

 

    Transfer: http://blog.csdn.net/yycdaizi/article/details/43341239

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326590570&siteId=291194637