1 Optimization instructions
The calculation tasks of Hive are completed by MapReduce, and the parallelism adjustment is divided into Map end and Reduce end.
1.1 Map side parallelism
The parallelism of the map side and the number of maps are determined by the number of slices of the input file. Generally, the parallelism of the map side does not need to be adjusted manually.
In special cases (there are a large number of small files in the query table, and the map side has complex query logic), manual adjustment can be considered.
1. There are a large number of small files in the queried table
. Hadoop’s default slicing strategy, a small file will independently start a map task to be responsible for calculation. If there are a large number of small files in the query table, a large number of map tasks will be started, resulting in waste of resources.
Solution:
Use CombineHiveInputFormat provided by HIve to combine multiple small files into one slice and control the number of map tasks.
Related parameters:
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
2. There is complex query logic on the map side.
When there are complex and time-consuming query logic such as regular replacement and json parsing in the SQL statement, the calculation on the map side will be relatively slow.
Solution:
In the case of sufficient computing resources, consider increasing the parallelism of the map side, so that there are more map tasks, and each map task calculates less data.
Related parameters:
--一个切片的最大值
set mapreduce.input.fileinputformat.split.maxsize=256000000;
1.2 Reduce parallelism
The degree of parallelism at the Reduce side, that is, the number of Reduces, can be determined by the user, or can be estimated by Hive based on the file size input by the MR Job.
Related parameters:
--指定Reduce端并行度,默认值为-1,表示用户未指定
set mapreduce.job.reduces;
--Reduce端并行度最大值
set hive.exec.reducers.max;
--单个Reduce Task计算的数据量,用于估算Reduce并行度
set hive.exec.reducers.bytes.per.reducer;
The logic for determining the degree of parallelism on the Reduce side is as follows:
If the value of the specified parameter mapreduce.job.reduces is a non-negative integer, the degree of parallelism in Reduce is the specified value. Otherwise, Hive estimates the parallelism of Reduce by itself, and the estimation logic is as follows:
(1) Assume that the size of the file input by the Job is totalInputBytes; (
2) The value of the parameter hive.exec.reducers.bytes.per.reducer is bytesPerReducer;
(3) The parameter hive The value of .exec.reducers.max is maxReducers;
then the degree of parallelism on the reduce side is:
When Hive estimates the parallelism degree of reduce by itself, it is based on the file size input by the entire MR job. Therefore, in some cases, the estimated parallelism degree may not be accurate. In this case, the user needs to specify the Reduce parallelism degree according to the actual situation.
2 cases
1. Sample SQL statement
select
province_id,
count(*)
fromorder_detail
groupby province_id;
2. Before optimization,
when the reduce parallelism is not specified, the logic for Hive to estimate the parallelism by itself is as follows:
totalInputBytes=1136009934
bytesPerReducer=256000000
maxReducers=1009
Reduce parallelism is:
3. The optimization idea
is to perform map-side aggregation by default, that is, the data received by the reduce end is actually the result of the aggregation completed by the map end.
Observing the execution process of tasks in yarn, you will find that there are only 34 records output by each map terminal, and there are 5 map tasks in total. That is to say, the reduce side will actually only receive 170 (34*5) records, so theoretically setting the parallelism of the reduce side to 1 is enough. In this case, the user can set the reduce parallelism to 1 through the following parameters.
--指定Reduce端并行度,默认值为-1,表示用户未指定
set mapreduce.job.reduces=1;