Reasons and solutions for data skew in Hive

1 The performance of data skew (uneven distribution of data)

  • The task progress has been maintained at 99% (or 100%) for a long time. Looking at the task monitoring page, it is found that only a small number (one
    or several) of the reduce subtasks have not been completed. Because the amount of data there is too different from other reduce.
  • The difference between the number of records in a single reduce and the average number of records is too large, usually three times or more.
  • The longest duration is much longer than the average duration.

2 Reasons for Data Skew

Uneven key distribution, business data characteristics, table creation problems, and a certain SQL statement itself has data skew (join connection, group by grouping, and Count Distinct to calculate the number after deduplication).

Key words situation as a result of
Join One of the tables is small, but the key is concentrated The data distributed to one or several Reduces is much higher than the average
Large tables and large tables, but there are too many 0 or null values ​​in the judgment field for bucketing These null values ​​are handled by a reduce, which is very slow
group by group by dimension is too small, too many values The reduce of processing a certain value is often time-consuming
Count Distinct too many special values The reduce time spent on processing this special value

3 Solutions to Data Skew

3.1 Parameter adjustment:

hive.map.aggr=true

Partial aggregation at the Map end, equivalent to Combiner

hive.groupby.skewindata =true

Load balancing is performed when there is data skew. When the option is set to true, the generated query plan will have two MR jobs. In the first MR Job, the output result set of the Map will be randomly distributed to the Reduces, and each Reduce will perform a partial aggregation operation and output the results. The result of this processing is that the same Group By Key may be distributed to different Reduces , so as to achieve the purpose of load balancing; the second MR Job is distributed to the Reduce according to the Group By Key according to the preprocessed data results (this process can ensure that the same Group By Key is distributed to the same Reduce), and finally completes the final aggregation operation.

3.2 SQL statement adjustment:

How to join :

Regarding the selection of the driving table, the table with the most uniform distribution of join keys is selected as the driving table

Do a good job of column pruning and filter operations to achieve the effect of relatively smaller data volume when two tables are joined.

Size table Join :

Use map join to make small dimension tables (number of records less than 1000) advanced memory. Complete reduce on the map side.

Large table Join large table:

Change the key of the null value into a string and add a random number, and divide the skewed data into different reducers. Since the null value cannot be associated, the final result will not be affected after processing.

count distinct a large number of the same special value

When count distinct, the case where the value is empty is processed separately. If count distinct is calculated, it can be filtered directly without processing, and 1 is added to the final result. If there are other calculations, you need to perform group by, you can first process the records with empty values ​​separately, and then perform union with other calculation results.

The group by dimension is too small:

Use sum() group by to replace count(distinct) to complete the calculation.

Special handling in special circumstances:

In the case where the effect of business logic optimization is not great, sometimes the skewed data can be taken out and processed separately. Finally the union goes back.

Guess you like

Origin blog.csdn.net/wilde123/article/details/118785360