[Hive of big data] Twenty-five, HQL syntax optimization for small file merging

1 Optimization instructions

  Small file optimization can be solved from two aspects, the small file input on the Map side is merged, and the small file output on the Reduce side is merged.

1.1 Merge of input files on the Map side

  Merging small files input from the Map side refers to dividing multiple small files into the same slice and processing them with one Map Task to prevent a single small file from starting a Map Task, resulting in waste of resources.
Related parameters:

--将多个小文件切片合成一个切片,由一个map task处理
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

1.2 Reduce output file merge

  Merging the small files output by the reduce end refers to merging multiple small files into large files to reduce the number of small HDFS files.
Principle:
  According to the average size of the output file of the calculation task, if the condition is met, an additional task will be started separately for merging.
Related parameters:

--开启合并map only任务输出的小文件,针对只有map的计算任务
set hive.merge.mapfiles=true;

--开启合并map reduce任务输出的小文件
set hive.merge.mapredfiles=true;

--合并后的文件大小
set hive.merge.size.per.task=256000000;

--触发小文件合并任务的阈值,若某计算任务输出的文件平均大小低于该值,则触发合并
set hive.merge.smallfiles.avgsize=16000000;

2 cases

1. Sample SQL statement

--计算各省份订单金额总和,下表为结果表
drop table if exists order_amount_by_province;
create table order_amount_by_province(
    provonce_id string comment '省份id',
    order_amount decimal(16,2) comment '订单金额'
)
location '/order_amount_by_province';

insert overwrite table order_amount_by_province
select
    province_id,
    sum(total_amount)
from order_detail
group by province_id;

2.
  According to the task parallelism before optimization, by default, the parallelism of the Reduce side of the sql statement is 5, so the number of final output files is also 5, and all of them are small files.

3. Optimization idea
scheme 1. Reasonably set the parallelism of the reduce side of the task.
  Set the parallelism of the task to 1 to ensure that the output result is 1 file.

set mapreduce.job.reduces=1;

Solution 2: Enable HIve to merge small files for optimization
Setting parameters:

--开启合并map reduce任务输出的小文件
set hive.merge.mapredfiles=true;

--合并后的文件大小
set hive.merge.size.per.task=256000000;

--触发小文件合并任务的阈值,若某计算任务输出的文件平均大小低于该值,则触发合并
set hive.merge.smallfiles.avgsize=16000000;

Guess you like

Origin blog.csdn.net/qq_18625571/article/details/131214841