hive execution statement optimization

1. SQL-like statement optimization

1. The optimization principle is basically the same as that of SQL

1.1 Try to atomize operations as much as possible

Try to avoid a SQL containing complex logic, you can use the intermediate table to complete the complex logic.

1.2 Filter data as early as possible

Try where to filter first and then join to reduce the amount of data in each stage. For partitioned tables, add partitioning conditions and only select the fields that need to be used.

1.3 Try to use the same connection key when linking tables

When joining 3 or more tables, only one MapReduce job will be generated if each on clause uses the same join key.

2. Optimization principles inconsistent with SQL

2.1 Small table joins large table

When connecting the tables, put the large table at the back (that is, the small table joins the large table, contrary to the traditional sql). Hive assumes that the last table in the query is a large table. It will cache other tables and scan the last table.

2.2 Replace union all with insert into

If the number of parts of union all is greater than 2, or each union part has a large amount of data, it should be split into multiple insert into statements. In the actual test process, the execution time can be greatly improved.

2.3 Try to replace order by with sort by

order by : Sort the query results globally, which takes a long time. Need to set hive.mapred.mode=nostrict
sort by : local sorting, not global sorting, improve efficiency
order by sorting, there is only one reduce, so the efficiency is relatively low. Sort by operation can be used, usually combined with distribute by as the reduce partition key

2.4 Optimize the result of limit statement quickly

Under normal circumstances, the Limit statement still needs to execute the entire query statement, and then return partial results.
There is a configuration property that can be turned on to avoid this situation - sample the data source
hive.limit.optimize.enable=true - enable the function of sampling the data source
hive.limit.row.max.size - set the minimum sampling size
hive.limit.optimize.limit.file — set the maximum number of samples to sample
Disadvantage : it is possible that some data will never be processed

2. Data skew optimization

1. Phenomenon description

Data skew problems are often encountered in hive. The performance is as follows: the task progress is maintained at 99% (or 100%) for a long time. When viewing the task monitoring page, it is found that only a small number (one or several) of reduce subtasks have not been completed.
Because the amount of data processed by a single reduce is too different from other reducers. Usually 3 times or more is possible.

2. Reason

1), uneven distribution of keys
2), characteristics of business data itself
3), poor consideration when building tables
4), some SQL statements themselves have data skew

Key words situation as a result of
join One of the tables is smaller, but the keys are concentrated The data distributed to one or several Reduces is much higher than the average
join Large tables and large tables, but there are too many 0 or null values ​​in the bucketed judgment field These null values ​​are processed by a reduce, which is very slow
group by The group by dimension is too small, and the number of a value is too large It takes a long time to process the reduce of a value
count distinct Too many special values Processing this special value reduce time-consuming

3. Solutions

3.1 Commonly used treatment methods

First hive.groupby.skewindata=true, two MR Jobs are generated through control, and the output results of the first MR Job Map are randomly allocated to the reduce for pre-aggregation. Each reduce does a partial aggregation operation and outputs the results, so that the processing results are the same Group By Key It may be distributed to different Reduces, so as to achieve the purpose of load balancing; reduce the data skew problem caused by too many key values ​​and some too small keys.

Second, by hive.map.aggr = true(默认为true)doing a combiner on the Map side, if the data of the map are basically different, the aggregation is meaningless, and doing a combiner is superfluous, and hive also considers it more thoughtfully hive.groupby.mapaggr.checkinterval = 100000 (默认)hive.map.aggr.hash.min.reduction=0.5(默认). Number/100000>0.5, no more aggregation

3.2 SQL Statement Adjustment

When a large table is joined to a large table:
change the key of the null value into a string plus a random number, and divide the skewed data into different reducers. Since the null value is not related, the final result will not be affected after processing.

When count distinct has a large number of the same special values:
In count distinct, the case where the value is empty is handled separately. If count distinct is calculated, it can be filtered directly without processing, and 1 is added to the final result. If there are other calculations, you need to perform group by, you can first process the records with empty values ​​separately, and then perform union with other calculation results.

The group by dimension is too small:
use sum() group by to replace count(distinct) to complete the calculation.

Special handling for special cases:
In the case that the business logic optimization effect is not large, sometimes the skewed data can be taken out and processed separately. Finally the union goes back.

3. Adjust the stage optimization of mapper and reducer

1. Map stage optimization

The main determinants of the number of maps are: the total number of input files, the input file size, and the file block size set by the cluster (default 128M, not customizable).
Determined by three configurations,

mapred.min.split.size.per.node  一个节点上split的至少的大小
mapred.min.split.size.per.rack  一个交换机下split至少的大小
mapred.max.split.size  一个split最大的大小

For example, if there are a large number of small files (less than 128M), multiple maps will be generated. To reduce the number of maps, the processing method is:

set mapred.max.split.size=100000000; 
set mapred.min.split.size.per.node=100000000; 
set mapred.min.split.size.per.rack=100000000;  
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; -- 执行前进行小文件合并

The first three parameters determine the size of the merged file block. Those larger than the file block size of 128m are separated by 128m, those smaller than 128m and larger than 100m are separated by 100m, and those smaller than 100m (including small files and separated large files are left). ) to merge in the
same way, increase the number of maps and also make corresponding settings

2. Reduce phase optimization

The number of reduce is determined by the following three parameters:

mapred.reduce.tasks  强制指定reduce的任务数量
hive.exec.reducers.bytes.per.reducer  每个reduce任务处理的数据量,默认为1000^3=1G
hive.exec.reducers.max  每个任务最大的reduce数,默认为999  

Generally, according to the total size of the input file, its estimation function is used to automatically calculate the number of reducers: the formula for calculating the number of reducers is very simpleN=min( hive.exec.reducers.max ,总输入数据量/ hive.exec.reducers.bytes.per.reducer )

PS: This piece of optimization has no personal experience. The optimization problems encountered at present are mainly concentrated on the first two points.


4. Learning reference materials

1 、https://www.cnblogs.com/sandbank/p/6408762.html
2 、https://www.cnblogs.com/xd502djj/p/3799432.html

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324519129&siteId=291194637
Recommended