[Hive of Big Data] 24. Task Parallelism of HQL Syntax Optimization

1 Optimization instructions

  The calculation tasks of Hive are completed by MapReduce, and the parallelism adjustment is divided into Map end and Reduce end.

1.1 Map side parallelism

  The parallelism of the map side and the number of maps are determined by the number of slices of the input file. Generally, the parallelism of the map side does not need to be adjusted manually.
  In special cases (there are a large number of small files in the query table, and the map side has complex query logic), manual adjustment can be considered.
1. There are a large number of small files in the queried table
  . Hadoop’s default slicing strategy, a small file will independently start a map task to be responsible for calculation. If there are a large number of small files in the query table, a large number of map tasks will be started, resulting in waste of resources.
Solution:
  Use CombineHiveInputFormat provided by HIve to combine multiple small files into one slice and control the number of map tasks.
Related parameters:

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

2. There is complex query logic on the map side.
  When there are complex and time-consuming query logic such as regular replacement and json parsing in the SQL statement, the calculation on the map side will be relatively slow.
Solution:
  In the case of sufficient computing resources, consider increasing the parallelism of the map side, so that there are more map tasks, and each map task calculates less data.
Related parameters:

--一个切片的最大值
set mapreduce.input.fileinputformat.split.maxsize=256000000;

1.2 Reduce parallelism

  The degree of parallelism at the Reduce side, that is, the number of Reduces, can be determined by the user, or can be estimated by Hive based on the file size input by the MR Job.
Related parameters:

--指定Reduce端并行度,默认值为-1,表示用户未指定
set mapreduce.job.reduces;

--Reduce端并行度最大值
set hive.exec.reducers.max;

--单个Reduce Task计算的数据量,用于估算Reduce并行度
set hive.exec.reducers.bytes.per.reducer;

The logic for determining the degree of parallelism on the Reduce side is as follows:
  If the value of the specified parameter mapreduce.job.reduces is a non-negative integer, the degree of parallelism in Reduce is the specified value. Otherwise, Hive estimates the parallelism of Reduce by itself, and the estimation logic is as follows:
  (1) Assume that the size of the file input by the Job is totalInputBytes; (
  2) The value of the parameter hive.exec.reducers.bytes.per.reducer is bytesPerReducer;
  (3) The parameter hive The value of .exec.reducers.max is maxReducers;
  then the degree of parallelism on the reduce side is:
insert image description here
  When Hive estimates the parallelism degree of reduce by itself, it is based on the file size input by the entire MR job. Therefore, in some cases, the estimated parallelism degree may not be accurate. In this case, the user needs to specify the Reduce parallelism degree according to the actual situation.

2 cases

1. Sample SQL statement

select
    province_id,
    count(*)
fromorder_detail
groupby province_id;

2. Before optimization,
when the reduce parallelism is not specified, the logic for Hive to estimate the parallelism by itself is as follows:

totalInputBytes=1136009934
bytesPerReducer=256000000
maxReducers=1009

Reduce parallelism is:
insert image description here
3. The optimization idea
  is to perform map-side aggregation by default, that is, the data received by the reduce end is actually the result of the aggregation completed by the map end.
  Observing the execution process of tasks in yarn, you will find that there are only 34 records output by each map terminal, and there are 5 map tasks in total. That is to say, the reduce side will actually only receive 170 (34*5) records, so theoretically setting the parallelism of the reduce side to 1 is enough. In this case, the user can set the reduce parallelism to 1 through the following parameters.
insert image description here

--指定Reduce端并行度,默认值为-1,表示用户未指定
set mapreduce.job.reduces=1;

Guess you like

Origin blog.csdn.net/qq_18625571/article/details/131214751