hive prevents data skew References

 

http://www.cnblogs.com/end/archive/2012/06/19/2554582.html    is well written

 

 

1 Reasons for data skew
1.1 Operation:

Key words

situation

as a result of

Join

One of the tables is smaller,

But the key is concentrated

The data distributed to one or several Reduces is much higher than the average

Large tables and large tables, but there are too many 0 or null values ​​in the bucketed judgment field

These null values ​​are processed by a reduce, which is very slow

group by

group by dimension is too small,

Too many values

It takes a long time to process the reduce of a value

Count Distinct

Too many special values

The reduce time spent processing this special value

1.2 Reason:

1), the key distribution is uneven

2), the characteristics of business data itself

3), ill-considered when building the table

4), some SQL statements themselves have data skew

1.3 Performance:

The task progress is maintained at 99% (or 100%) for a long time. Check the task monitoring page and find that only a few (one or several) reduce subtasks have not been completed. Because the amount of data processed by it is too different from other reducers.

The difference between the number of records of a single reduce and the average number of records is too large, usually 3 times or more. The longest duration is much longer than the average duration.

2 Solutions for data skew

2.1 Parameter adjustment:

hive.map.aggr = true

Map-side partial aggregation, equivalent to Combiner

hive.groupby.skewindata=true

Load balancing is performed when data is skewed. When the option is set to true, the generated query plan will have two MR jobs. In the first MR Job, the output result set of Map will be randomly distributed to Reduces, each Reduce will perform partial aggregation operations and output the results. The result of this processing is that the same Group By Key may be distributed to different Reduces. , so as to achieve the purpose of load balancing; the second MR Job is distributed to the Reduce according to the Group By Key according to the preprocessed data results (this process can ensure that the same Group By Key is distributed to the same Reduce), and finally completes the final aggregation operation.

2.2 SQL statement adjustment:

How to join:

Regarding the selection of the driving table, select the table with the most uniform distribution of join keys as the driving table

Do a good job of column cutting and filter operations to achieve the effect that the amount of data is relatively small when the two tables are joined.

Size table Join:

Use map join to make small dimension tables (less than 1000 records) advanced in memory. Complete the reduce on the map side.

Large table Join large table:

Change the key of the null value into a string and add random numbers, and divide the skewed data into different reducers. Since the null value is not related, it will not affect the final result after processing.

count distinct a large number of the same special value

When count distinct is used, the case where the value is empty is handled separately. If count distinct is calculated, it can be filtered directly without processing, and 1 is added to the final result. If there are other calculations, you need to perform group by, you can first process the records with empty values ​​separately, and then perform union with other calculation results.

The group by dimension is too small:

Use sum() group by to replace count(distinct) to complete the calculation.

Special handling for special cases:

In the case that the business logic optimization effect is not large, sometimes the skewed data can be taken out and processed separately. Finally the union goes back.

3 Typical business scenarios

3.1 Data skew caused by null values

Scenario: For example, in the log, there is often a problem of information loss, such as the user_id in the log. If the user_id in the log is associated with the user_id in the user table, the problem of data skew will be encountered.

Solution 1:  The user_id is empty and does not participate in the association (red font is after modification)

select * from log a
  join users b
  on a.user_id is not null
  and a.user_id = b.user_id
union all
select * from log a
  where a.user_id is null;

Solution 2: Assign a new key value with a null value

select *
  from log a
  left outer join users b
  on case when a.user_id is null then concat(‘hive’,rand() ) else a.user_id end = b.user_id;

Conclusion: Method 2 is more efficient than method 1, not only has less io, but also has fewer jobs. In solution 1, the log is read twice, and the jobs are 2. Solution 2 The number of jobs is 1. This optimization is suitable for skew problems caused by invalid ids (such as -99 , '', null, etc.). By changing the key of the null value into a string and adding a random number, the skewed data can be divided into different reducers to solve the problem of skewed data.

3.2 The association of different data types produces data skew

Scenario: The user_id field in the user table is int, and the user_id field in the log table has both string and int types. When the Join operation of the two tables is performed according to the user_id, the default Hash operation will be allocated according to the id of the int type, which will cause all records of the string type id to be allocated to a Reducer.

Solution: Convert the numeric type to a string type

select * from users a
  left outer join logs b
  on a.usr_id = cast(b.user_id as string)

3.3 The small table is not too small, how to use map join to solve the skew problem

Use map join to solve the data skew problem of small tables (small number of records) associated with large tables. This method is used very frequently. However, if the small table is very large, the map join will cause bugs or exceptions. At this time, special special deal with. The following example: 

select * from log a
  left outer join users b
  on a.user_id = b.user_id;

users 表有 600w+ 的记录,把 users 分发到所有的 map 上也是个不小的开销,而且 map join 不支持这么大的小表。如果用普通的 join,又会碰到数据倾斜的问题。

解决方法:

select /*+mapjoin(x)*/* from log a
  left outer join (
    select  /*+mapjoin(c)*/d.*
      from ( select distinct user_id from log ) c
      join users d
      on c.user_id = d.user_id
    ) x
  on a.user_id = b.user_id;

假如,log里user_id有上百万个,这就又回到原来map join问题。所幸,每日的会员uv不会太多,有交易的会员不会太多,有点击的会员不会太多,有佣金的会员不会太多等等。所以这个方法能解决很多场景下的数据倾斜问题。

4总结

使map的输出数据更均匀的分布到reduce中去,是我们的最终目标。由于Hash算法的局限性,按key Hash会或多或少的造成数据倾斜。大量经验表明数据倾斜的原因是人为的建表疏忽或业务逻辑可以规避的。在此给出较为通用的步骤:

1、采样log表,哪些user_id比较倾斜,得到一个结果表tmp1。由于对计算框架来说,所有的数据过来,他都是不知道数据分布情况的,所以采样是并不可少的。

2、数据的分布符合社会学统计规则,贫富不均。倾斜的key不会太多,就像一个社会的富人不多,奇特的人不多一样。所以tmp1记录数会很少。把 tmp1和users做map join生成tmp2,把tmp2读到distribute file cache。这是一个map过程。

3、map读入users和log,假如记录来自log,则检查user_id是否在tmp2里,如果是,输出到本地文件a,否则生 成<user_id,value>的key,value对,假如记录来自member,生成<user_id,value>的 key,value对,进入reduce阶段。

4、最终把a文件,把Stage3 reduce阶段输出的文件合并起写到hdfs。

如果确认业务需要这样倾斜的逻辑,考虑以下的优化方案:

1、对于join,在判断小表不大于1G的情况下,使用map join

2、对于group by或distinct,设定 hive.groupby.skewindata=true

3、尽量使用上述的SQL语句调节进行优化

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326378727&siteId=291194637