[Hive of big data] 23. Data tilt of HQL syntax optimization

1 Overview of Data Skew

  Data skew refers to the uneven distribution of data involved in the calculation, that is, the amount of data of a certain key or certain keys far exceeds that of other keys, resulting in a large amount of data with the same key being sent to the same Reduce during the shuffle phase, resulting in the The time is far longer than other Reduces and becomes the bottleneck of the entire task.
  Data skew in Hive often occurs in scenarios of group aggregation and join operations.

2 Data skew caused by group aggregation

2.1 Optimization instructions

  Unoptimized group aggregation in Hive is implemented through a MapReduce Job. The Map end is responsible for reading the data, partitioning it according to the grouping field, and sending the data to the Reduce end through Shuffle, and each group of data completes the final aggregation operation on the Reduce end.
  If the values ​​of the group by field are unevenly distributed, a large number of identical keys may enter the same reduce, resulting in data skew.

Solution ideas:
Map-Site aggregation and Skew-GroupBy optimization.

1. Map-Side aggregation

  After the Map-Side aggregation is enabled, the data will complete part of the aggregation work on the Map side (it can be regarded as slices of the same size, which is equivalent to completing the aggregation work in the slice first).

  Even if the original data is skewed, the data sent to Reduce will no longer be skewed after preliminary aggregation on the Map side. In the best state, Map-side aggregation can completely shield the data skew problem.

Related parameters:

--启用map-side聚合
set hive.map.aggr=true;

--用于检测源表数据是否适合进行map-side聚合。检测的方法是:
--先对若干条数据进行map-side聚合,若聚合后的条数和聚合前的条数比值小于该值,
--则认为该表适合进行map-side聚合;否则,认为该表数据不适合进行map-side聚合,
--后续数据便不再进行map-side聚合。
set hive.map.aggr.hash.min.reduction=0.5;

--用于检测源表是否适合map-side聚合的条数。
set hive.groupby.mapaggr.checkinterval=100000;

--map-side聚合所用的hash table,占用map task堆内存的最大比例,
--若超出该值,则会对hash table进行一次flush。
set hive.map.aggr.hash.force.flush.memory.threshold=0.9;

2. Skew-GroupBy optimization

  Skew-GroupBy is specially designed to deal with data skew due to group by.

Principle:
  Two MR tasks are started. The first MR is partitioned according to random numbers, and the data is scattered and sent to Reduce to complete partial aggregation; the second
  MR reads the data from the first Reduce end and partitions according to the grouping field to complete the final aggregation.

Related parameters:

--启用分组聚合数据倾斜优化
set hive.groupby.skewindata=true;

2.2 Case

1. Sample SQL statement

select
    province_id,
    count(*)
from order_detail
group by province_id;

2.
  The province_id field in the table before optimization is skewed. If it is not optimized, the data skew phenomenon can be seen by observing the execution process of the reduce task in yarn.
  The map-side aggregation in hive is enabled by default. If you want to see data skew, you need to set the hive.map.aggr parameter to false first.
3. Optimization idea
(1) Map-Side aggregation
Setting parameters:

--启用map-side聚合
set hive.map.aggr=true;
--关闭skew-groupby
set hive.groupby.skewindata=false;

Execution plan:
insert image description here
  Observing the execution process of the reduce task in yarn, it is obvious that after the map-side aggregation is enabled, the reduce data is no longer skewed.
(2) Skew-GroupBy optimization
Setting parameters:

--启用skew-groupby
set hive.groupby.skewindata=true;
--关闭map-side聚合
set hive.map.aggr=false;

  After enabling Skew-GroupBy optimization, you can clearly see that the SQL execution starts two mr tasks on yarn, the first mr breaks up the data, and the second mr groups and aggregates the broken up data.
insert image description here

2.3 Summary

  Map-site optimization is better than Skew-GroupBy optimization. Map-site aggregation can be done if Map-site aggregation can be done.

  Map-site aggregation needs to maintain a HashTable on the Map side. HashTable consumes memory, that is, try not to do Map-site aggregation when the memory is insufficient, but it can also be done. When the HashTable exceeds the set memory threshold, it will be flushed .

  When the memory is small, it will be flushed many times, causing the map-side aggregation to fail to solve the problem of data skew. At this time, Skew-GroupBy aggregation can be used. That is to say, regardless of sufficient memory, the data can be broken up and then aggregated.

3 Data skew caused by Join

3.1 Optimization instructions

  The unoptimized join operation uses the common join algorithm by default, that is, the calculation is completed through a MapReduce job. The Map end is responsible for reading the data of the table required for the join operation, partitioning it according to the associated fields, and sending it to the Reduce end through Shuffle, and the data of the same key completes the final Join operation on the Reduce end.

  If the values ​​of associated fields are unevenly distributed, a large number of identical keys may enter the same Reduce, resulting in data skew.

  There are three solutions to the data skew problem caused by join: map join, skew join, and adjust SQL statements.

1、map join

  In map join, the join operation can only be completed on the map side. There is no shuffle operation and no reduce phase. Naturally, there will be no data skew on the reduce side. This solution is suitable for scenarios where data skew occurs when a large table joins a small table.

Principle:
  Multiple Mappers cache small table data, large table data is sliced ​​(slicing is only related to size, not related to key), evenly cut into several slices, and one map is responsible for processing a slice (traversing one by one and then looking for cached small tables) Data join) and then output one by one, so that the amount of data processed by each map is consistent, and the problem of data skew is solved.

Related parameters:

--启动Map Join自动转换
set hive.auto.convert.join=true;

--一个Common Join operator转为Map Join operator的判断条件,
--若该Common Join相关的表中,存在n-1张表的大小总和<=该值,则生成一个Map Join计划,
--此时可能存在多种n-1张表的组合均满足该条件,则hive会为每种满足条件的组合均生成一个Map Join计划,
--同时还会保留原有的Common Join计划作为后备(back up)计划,实际运行时,
--优先执行Map Join计划,若不能执行成功,则启动Common Join后备计划。
set hive.mapjoin.smalltable.filesize=250000;

--开启无条件转Map Join
set hive.auto.convert.join.noconditionaltask=true;

--无条件转Map Join时的小表之和阈值,若一个Common Join operator相关的表中
--存在n-1张表的大小总和<=该值,此时hive便不会再为每种n-1张表的组合均生成Map Join计划,
--同时也不会保留Common Join作为后备计划。而是只生成一个最优的Map Join计划。
set hive.auto.convert.join.noconditionaltask.size=10000000;

2、skew join

  Skew join solves the problem of data skew in large table joins.

Principle:
  Start a map join task for calculations for the skewed large key, and perform normal common join for the rest of the keys.
insert image description here
Related parameters:

--启用skew join优化
sethive.optimize.skewjoin=true;
--触发skew join的阈值,若某个key的行数超过该参数值,则触发(按照行数进行检测)
set hive.skewjoin.key=100000;

  This scheme has no requirement on the size of the source tables participating in the join, but it does have requirements on the data volume of the skewed keys in the two tables, and requires that the data volume of the skewed keys in one table is relatively small (convenient for mapjoin).

3. Adjust the SQL statement

  If the two tables participating in the join are both large tables, and the data in one of the tables is skewed, you can adjust the SQL statement accordingly.
  Assume that the original SQL statement is as follows: both tables A and B are large tables, and the data in one of the tables is skewed.

select
    *
from A
join B
on A.id=B.id;

insert image description here
  1001 in the figure is a large tilted key, which is sent to the same Reduce for processing.

Adjust the SQL statement as follows:

select
    *
from(
    select --打散操作,加随机数0、1
        concat(id,'_',cast(rand()*2 as int)) id,
        value
    from A
)ta
join(
    select --扩容操作
        concat(id,'_',0) id,
        value
    from B
    union all
    select
        concat(id,'_',1) id,
        value
    from B
)tb
on ta.id=tb.id;

Adjusted SQL statement execution plan:
insert image description here

3.2 Case

1. Sample SQL statement

select
    *
from order_detail od
join province_info pi
on od.province_id=pi.id;

2. Before optimization

--关闭Map Join自动转换
set hive.auto.convert.join=false;

--关闭skew join优化(默认为关闭状态)
sethive.optimize.skewjoin=false;

  The province_id field in the order_detail table is skewed. If it is not optimized, you can see the phenomenon of data skew by observing the reduce task in yarn.
  The map join automatic conversion in hive is enabled by default. If you want to see the phenomenon of data skew, you need to set the hive.auto.convert.join parameter to false first.
3. Optimization ideas
(1) Map join
setting parameters:

--启用map join
set hive.auto.convert.join=true;
--关闭skew join
set hive.optimize.skewjoin=false;

  It can be clearly seen that after map join is enabled, the mr task only has a map phase, no reduce phase, and no data skew occurs.
insert image description here
insert image description here
(2) skew join
setting parameters:

--启动skew join
set hive.optimize.skewjoin=true;
--关闭map join
set hive.auto.convert.join=false;

After enabling skew join, use explain to view the execution plan:
insert image description here
  skew join takes effect, tasks have both common join and map join for some keys. And the sql finally starts two mr tasks on yarn, and the second task has only map and no reduce phase, indicating that the second task is to perform map join on the skewed key.
insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/qq_18625571/article/details/131197840