Hive SQL Optimization Ideas

Hive optimization is mainly divided into: configuration optimization, SQL statement optimization, task optimization and other solutions. Among them, the main part involved in the development process may be SQL optimization.

The core idea of ​​optimization is:

  • Reduce data volume (e.g. partitioning, column pruning)

  • Avoid data skew (such as adding parameters, key breaking)

  • Avoid full table scans (e.g. on add plus partition, etc.)

  • Reduce the number of jobs (eg joins with the same on condition are put together as one task)

HQL statement optimization

1. Use partition pruning and column pruning

In partition pruning, when external association is used, if the filter condition of the secondary table is written after Where, then the whole table will be associated first, and then filtered.

select a.*  
from a  
left join b on  a.uid = b.uid  
where a.ds='2020-08-10'  
and b.ds='2020-08-10'

The above SQL mainly made two mistakes :

  1. The where condition of the secondary table (the b table above) is written after the join, which will cause the entire table to be associated with the filter partition first.

Note: Although the where condition of the a table is also written after the join, the a table will push down the predicate, that is, execute the where condition first, and then execute the join, but the b table will not push down the predicate!

  1. The condition of on does not filter null values. If there are a large number of null values ​​in the two data tables, the data will be skewed.

Correct spelling :

select a.*  
from a  
left join b on (d.uid is not null and a.uid = b.uid and b.ds='2020-08-10') 
where a.ds='2020-08-10'

If the null value is also required, it needs to be converted on the condition, or taken out separately

select a.*  
from a  
left join b on (a.uid is not null and a.uid = b.uid and b.ds='2020-08-10')  
where a.ds='2020-08-10'  
union all  
select a.* from a where a.uid is null 

or:

select a.*  
from a  
left join b on   
case when a.uid is null then concat("test",RAND()) else a.uid end = b.uid and b.ds='2020-08-10'  
where a.ds='2020-08-10'

or (subquery):

select a.*  
from a  
left join   
(select uid from where ds = '2020-08-10' and uid is not null) b on a.uid = b.uid 
where a.uid is not null  
and a.ds='2020-08-10'

2. Try not to use COUNT DISTINCT

Because the COUNT DISTINCT operation needs to be completed by a Reduce Task, the amount of data that this Reduce needs to process is too large, which will make it difficult to complete the entire Job. Generally, COUNT DISTINCT is replaced by GROUP BY and then COUNT, although one more Job will be used. to complete, but in the case of large amounts of data, this is definitely worth it.

select count(distinct uid)  
from test  
where ds='2020-08-10' and uid is not null  

translates to:

select count(a.uid)  
from   
(select uid 
 from test 
 where uid is not null and ds = '2020-08-10' 
 group by uid
) a

3. Use with as

In addition to the shuffle generated by join, another factor that slows down Hive query efficiency is sub-queries. Minimize sub-queries in SQL statements. with as is to extract the subquery used in the statement in advance (similar to a temporary table), so that all modules in the entire query can call the query result. Using with as can avoid Hive's repeated calculation of the same subquery in different parts.

select a.*  
from  a  
left join b on  a.uid = b.uid  
where a.ds='2020-08-10'  
and b.ds='2020-08-10'  

can be transformed into:

with test1 as 
(
select uid  
from b  
where ds = '2020-08-10' and uid is not null  
)  
select a.*  
from a  
left join test1 on a.uid = test1.uid  
where a.ds='2020-08-10' and a.uid is not null

4. Join of size table

There is a rule of thumb when writing queries with join operations: tables/subqueries with fewer entries should be placed on the left side of the join operator . The reason is that in the Reduce phase of the Join operation, the contents of the table on the left side of the Join operator will be loaded into the memory, and placing the table with fewer entries on the left side can effectively reduce the probability of OOM errors. However, the new version of hive has optimized the small table JOIN large table and the large table JOIN small table. There is no obvious difference between the small table on the left and the right. However, in the process of joining, the amount of data can be appropriately reduced and the efficiency can be improved by placing a small table in front.

5. Data skew

The principle of data skew is known, that is, one or several keys occupy 90% of the entire data, so the efficiency of the entire task will be slowed down by the processing of this key, and at the same time, the same keys may be aggregated together to cause memory loss overflow.

Data skew only happens during shuffle. Here are some commonly used operators that may trigger shuffle operations: distinct, groupByKey, reduceByKey, aggregateByKey, join, cogroup, repartition, etc. When data skew occurs, it may be caused by the use of one of these operators in your code.

The general processing scheme of hive's data skew :

Common practice, through parameter tuning:

set hive.map.aggr=true;  
set hive.groupby.skewindata = ture;

When the option is set to true, the generated query plan has two MapReduce tasks.

In the first MapReduce, the set of output results of the map will be randomly distributed among the reducers, and each reducer will perform partial aggregation operations and output the results.

The result of this processing is that the same Group By Key may be distributed to different reducers, so as to achieve the purpose of load balancing;

The second MapReduce task is then distributed to the reduce by Group By Key according to the preprocessed data results (this process can ensure that the same Group By Key is distributed to the same reduce), and finally completes the final aggregation operation.

But this solution is a black box for us and cannot be controlled.

The general solution is to break up the corresponding key values.

E.g:

select a.*  
from a  
left join b on  a.uid = b.uid  
where a.ds='2020-08-10'  
and b.ds='2020-08-10'  

If 90% of the keys are null, data skew will inevitably occur.

select a.uid  
from test1 as a  
join(  
   select case when uid is null then cast(rand(1000000) as int)  
   else uid  
   from test2 where ds='2020-08-10') b   
on a.uid = b.uid  
where a.ds='2020-08-10'  

Of course, this is only a theoretical solution.

The normal solution is to filter on null, but this is not a special key in everyday situations.

So how to deal with this kind of data skew in the case of daily needs:

  1. Sample sampling, which sets of keys are obtained;

  2. Add random numbers to the keys in the set according to certain rules;

  3. When joining is performed, data skew is avoided because it is broken up;

  4. In the processing result, the previously added random numbers are divided into original data.

Of course, these optimizations are all optimized for SQL itself, and some are adjusted through parameter settings, which will not be described in detail here.

But the core idea of ​​optimization is the same:

  1. Reduce the amount of data

  2. Avoid data skew

  3. Reduce the number of jobs

  4. Virtual core point: optimize the overall business implementation according to business logic;

  5. Virtual solution: use special query engines such as presto and impala, and use spark computing engine to replace MR/TEZ

This article is published on the official account [ Five Minutes to Learn Big Data ], and the keywords [cheats] [interview] [resume] are sent in the official account to obtain the corresponding super powerful data collection

Guess you like

Origin blog.csdn.net/helloHbulie/article/details/122184921